pzivich / zEpid

Epidemiology analysis package
http://zepid.readthedocs.org
MIT License
141 stars 33 forks source link

Add G-formula #10

Closed pzivich closed 6 years ago

pzivich commented 6 years ago

One lofty goal is to implement the G-formula. Would need to code two versions; time-fixed and time-varying. The Chapter by Robins & Hernan is good reference. I have code that implements the g-formula using pandas. It is reasonably fast.

TODO: generalize to a class, allow input models then predict, need to determine how to allow users to input custom treatment regimes (all/none/natural course are easy to do), compare results (https://www.ncbi.nlm.nih.gov/pubmed/25140837)

Time-fixed version will be relatively easy to write up

Time-varying will need the ability to specify a large amount of models and specify the order in which the models are fit.

Note; I am also considering reorganizing in v0.2.0 that IPW/g-formula/doubly robust will all be contained within a folder caused causal, rather than adding to the current ipw folder

pzivich commented 6 years ago

Thoughts on how to allow custom specification:

Allow input to be a str object like ["((gf['age']>=25) & (gf['female']==1")) So this would select those older than 24 and females and apply the treatment to this group only. In the background, the program would take that and run it as an executable via eval(treatment)

Don't know how i feel about this solution. The user would HAVE to specify gf for the function to work properly. Would need to somehow check this. Doesn't seem that elegant but it should get the job done. However it seems easy for the user to break...

pzivich commented 6 years ago

This is the solution I am going to use. It works well in my testing. User will need to specify ["((g['age']>=25) & (g['female']==1")] for custom treatment strategies. I am happy with how it works and it seems to be the easiest/most user friendly solution I have come across

Still might want to add in checks to see whether user specified g correctly otherwise it might cause some trouble.

pzivich commented 6 years ago

For TimeVaryGFormula lagged variables are causing me a little bit of a problem.

Basically, I think the user will need to specify which variables are lagged and what they correspond to. A dictionary where the keys are the variables and the values are the columns of the lagged variables.

With this addition, I should have some executable code for both g-formulas.

TODO: testing (compared to 722), see if there is any way to make it easier for the user, optimize run time (I am particularly concerned about the time-varying g-formula since it does a lot of operations

pzivich commented 6 years ago

Also functional forms for continuous covariates that are predicted are another complex issue... Not sure how to fix that yet

pzivich commented 6 years ago

Ugly solution and probably confusing is to use exec() to do the variable recoding. Essentially, this will require the user to write a block of recoding, like code_to_recode = "g['var_sq'] = g['var']**2; g['var_cu'] = g['var']**3

Not happy with this as the solution, however I am going to proceed since I don't know another solution yet. Functional forms for time-varying confounders is not great...

While not the most elegant solution, it should give the user a lot of control to specify. However it will get cumbersome for very complicated recodings. However, it might just be better to write a quick function and call that through this

pzivich commented 6 years ago

As somewhat expected, the time-varying g-formula takes a while to run. Memory error occurs with the sort_values functionality

I need to look into some ways to minimize the draw (and speed everything up if possible)

pzivich commented 6 years ago

Time-varying G-formula works. A little slow and pulls too much memory. Currently uses pandas.sort_values() after each step. This draws too much memory space. I need to find an alternative to sort by two columns (ID, time) or another way to append that uses the order

pzivich commented 6 years ago

Have a working version of both versions of the G-formula. Closing this issue since they have both been added to the GitHub page. Still have the TODO list in the TimeVary file itself.

At this stage, implementation is still experimental and has not been verified