pzivich / zEpid

Epidemiology analysis package
http://zepid.readthedocs.org
MIT License
141 stars 33 forks source link

Confusing about the time in MonteCarloGFormula #101

Closed Jnnocent closed 5 years ago

Jnnocent commented 5 years ago

Hi,I am confusing about how to use the time and history data. It seems that we should make a model with data of questionnaires k-1 and k-2, but I couldn't find such code implementation. I found you used data of time k only while fitting you model.

176 exposure_model

207 outcome_model

274 add_covariate_model

pzivich commented 5 years ago

Hi @Jnnocent MonteCarloGFormula fits the same model for all time points (t = {1, 2, 3, ... k}). To account for variable history, you can use lagged variables. In this notebook, I used include the time-varying variables at t=k-1 and t=k in the outcome model ('art + lag_art + ...'). The variables from the previous time point are lag_... and take the value from the previous time. As you will see in the fit() statement, there is a line to specify what variables are lagged in the Monte-Carlo procedure. While MonteCarloGFormula does not use distinct models for k-1 and k-2, it allows you to put history of variables into the regression model at time k. While I only put time-varying variables from the previous time point (k-1) in the example, there is theoretically no limit to the history you can include in the model, only practical constraints. For example, I could have created lag_lag_art, which would be art at t=k-2. I would need to add 'lag_art': 'lag_lag_art' to the dictionary in fit() if I wanted to implement this (also include lag_lag_art in the corresponding regression models).

MonteCarloGFormula follows the process described by Keil et al 2014. I replicated the analysis described in the paper, so you can follow along the paper's description with this notebook

Can you tell me a little more about your data set and study question? It may help me answer your question more clearly. Depending on the data, there may be a better estimator to use. In my experience, I have found MonteCarloGFormula to work best when you are interested in a time-varying treatment/exposure for survival data

Jnnocent commented 5 years ago

Thank you! I know more about this article and the code. And I study the MonteCarloGFormula by the following paper Intervening on risk factors for coronary heart disease: an application of the parametric g-formula . On page 1601, it describes the algorithm that it builds the model for each 2-year , which is different form that fitting the same model for all time points (t = {1, 2, 3, ... k}) image

On page 5 in Keil et al 2014, it is said that

the models for each covariate on day k were fit using only person-days for which the patient had not yet experienced each time-varying covariate on day k − 1

So, I feel confused about how to build the model that fitting the same model for all time points (t = {1, 2, 3, ... k}) or fitting the differsent models for each time points (t = {1, 2, 3, ... k}).

pzivich commented 5 years ago

The Taubman paper and Keil paper both use what is referred to as pooled logistic regression. This is the same approach that MonteCarloGFormula uses. This open-access paper has a further description of what pooled logistic regression is

The pooled logistic regression process works by fitting a single model to the full data with a term for time (t_in). The pooled regression models are not estimated separately for each time period. In Table 2, Taubman et al. mention including time as a dummy variable in their pooled logistic models. The pooled models are fit once to the full data and include all the variables are interested in plus time. The Taubman data would look something like the following

id t0 t1 Y  A L
 1  0  2 0  0 2
 1  2  4 0  1 3
 1  4  6 0  1 7
 1  6  8 1  1 1
 2  0  2 0  1 4
...

where A is the treatment, Y is the outcome, and L is a time-varying confounder. Following the Taubman paper, we would generate some lagged variables. For A, we might use something like

df['lag_A'] = df['A'].shift(1)
df['lag_A'] = np.where(df.groupby('id').cumcount() == 0, 0, df['lag_A'])
df['lag_lag_A'] = df['lag_A'].shift(1)
df['lag_lag_A'] = np.where(df.groupby('id').cumcount() == 0, 0, df['lag_lag_A'])

Our outcome model could look something like the following, 'A + lag_A + lag_lag_A + L + lag_L + lag_lag_L + C(t0)' This would model a dummy variable for time, like described by Taubman. You can also add interaction terms between C(t).

The pooled logistic models are different for each variable. Following Taubman, the exposure_model(), outcome_model(), and add_covariate_model() are all done in step 1. Step 2 is all done by the fit() process

However, MonteCarloGFormula does not currently allow multiple treatments, like described in Taubman. I may try to add that functionality in the future. Currently, only a single treatment can be assessed

Jnnocent commented 5 years ago

I know a little about the pooled logistic regression model, and I would like to read the paper to learn it morn. May I ask you another question? When can we use the censoring model? I simulate a population with censoring data and build a censoring model. Should we make a restriction that restriction="g['censor']==0" in the exposure_model or the outcome model or the covariate_model? For estimating the covariate need C_k =0 and so on. image

pzivich commented 5 years ago

The censor_model() only needs to be used if you believe there is informative censoring (generally there is informative censoring). This process builds a pooled regression model for censoring based on the specified model given to the argument. In the background, it predicts censoring during the Monte-Carlo procedure, right before estimating the outcome model.

As Taubman has it written, I believe there would be no restriction= arguments for censoring in any of the models. Their probabilities are conditional on not being censored at t=k for estimating the outcome at t=k+1. In how the data is set up for pooled logistic regression, if individual 1 was censored at t=3, they would not have a row in the data corresponding to t=4. I will read through the paper a little more carefully and make sure this is true

Jnnocent commented 5 years ago

I get it! Thanks for you answering in details! ^_^

pzivich commented 5 years ago

No problem @Jnnocent ! Got a chance to read through Taubman 2009 again. Below are some general thoughts / comments I had, mostly discussion of what is and isn't possible using MonteCarloGFormula as of v0.7.0

1) They talk about competing risks (Step 1c) but this is currently not available. I plan on adding in a future update #77 but haven't settled on how to implement it. There are essentially two options, a) a pooled multinomial model, or b) a series of pooled logistic models. Taubman et al. use the second option. Currently, MonteCarloGFormula only estimates a singular outcome (i.e. no competing risks in the sample).

2) As far as I can tell, Taubman et al. use the same fitting regression model process at MonteCarloGFormula. They describe 4 different types of models; binary, history-binary, full-continuous, and zero-continuous. Binary is done for exposure_model, outcome_model and add_covariate_model(..., var_type='binary'). History-binary can be done by adding the optional restriction argument. Continuous can be done by add_covariate_model(..., var_type='continuous'). Currently, there is no option that would cover how they fit their zero-continuous model. What I could do is add Poisson models for count data. Those models would cover count data (like cigarettes per day) better

3) Some of their treatment variables (like cigarettes per day) are continuous. Currently, MonteCarloGFormula does not support continuous treatments. I haven't figured out a clear API for how to specify continuous treatments. I am having the same problem with other g-formula implementations #49 The problem is that treatments can either shift everyone in the population (- 10 cigarettes per day) or it can set everyone to be below a threshold (everyone < 10 cigarettes per day).

4) To account for variables from two time-points prior, as described in the paper, I would use a model like 'A + lag_A + lag_lag_A + L + lag_L + lag_lag_L + C(t0)' and the process I described above.

5) They mention in their simulations section in the methods that they restricted predicted values to be within the observed range. You can do this by using the recode optional argument. I use a similar process in my example. For predicted CD4 T-cell counts, I require they be at least 1

Hope this helps

Jnnocent commented 5 years ago

Very clear! Had a deeper understanding, specially the significance of restriction argument!

Following the MonteCarloGFormula, I test my data. There are 11 years follow-up data that follow up each 2-3 years. And then I expended the data to create the person-month data. After specifying the model in step1, I fitted the model and generated a plot of the risk curves. As shown below image

In your mention

From this we can see that out natural course predictions (green) follow the observed data pretty well (black). Note: this does not mean that our models are correctly specified. Rather it only means they may not be incorrectly specified. Sadly, there is no way to know that all our models are correctly specified… We may take some comfort that our curves largely overlap, but do not take this for granted

As you have seen, the natural course predictions doesn't follow well the observed data. So, how to decide whether the model is believable or not?

And the total incident rate of these two population were very close(original population: 0.07438, simulative population: 0.08934)

pzivich commented 5 years ago

Based on your risk function for the true data, it doesn't look like that many events occur. Is this correct? It looks like there is only 6 or so events. With so few events, it will be hard to estimate the Monte Carlo g-formula.

I wouldn't trust the results from the g-formula. You can try to use year data instead to see if that helps.

And the total incident rate of these two population were very close(original population: 0.07438, simulative population: 0.08934)

Can you explain this a litter further? Based on the graph it looks like original: 0.2 and simulation: 0.45

Jnnocent commented 5 years ago

The cohort included 4208 people and 371 had an event, 3837 were censored when they didn't have an event during the follow-up. There were about 10000 person-time data, and it became 600000 after being expended. Only 371 had an event.

Test again by using year data image It was more normal than the plot that using month data.

It seemed that it was the way of expending data that made such wrong result. For the event were determined to occur in the last month on the follow-up year.

pzivich commented 5 years ago

Ahh yeah, so MonteCarloGFormula can only be divided into the minimal data collection period (you can divide further but it can be unstable, like in your data).

Something you may want to consider is using IterativeCondGFormula instead. This implementation works well for longitudinal data. Since your event times correspond to the end of the year, this may be a good option. Additionally, it only requires the specification of the outcome model (you don't need a treatment / time-varying confounder models). This paper describes what the iterative conditional g-formula is. I don't have a guide available for this yet, but it is on my to-do list

Jnnocent commented 5 years ago

Thank you! I would like to read the paper first! It seems simpler

Jnnocent commented 5 years ago

Hi, I read the paper of iterative conditional g-formula and have some questions

  1. It mentions in the Estimation: longitudinal TMLE section that

    For discrete-valued confounders, this can be written as follows image

So, in my opinions, only when the confounders are discrete, the model can be used. When when confounders are continuous, we prefer to use MonteCarloGFormula which has an advantage on the conditional probabilities calculating of continuous covariates.

  1. Can I use another Machine learning algorithm to replace the GLM ? In the paper, the author introduce the Super Learner algorithm. And I test the sample data in XGBoost model. like following

if self.model_type == 0: fm = smf.glm(d + ' ~ ' + m, df, family=linkdist).fit() # GLM else: xgb = xgboost_model(self, df, d, m)

where self.model_type is a way to indicate which model will be used. Compared with the GLM model, the prior_predict will be defined of 0 or 1 by

df[prior_predict] = np.where(df[d].isna(), np.nan, xgb.predict(tf)) image

While in the GLM model, It predict the probabilities

df[prior_predict] = np.where(df[d].isna(), np.nan, fm.predict(tf)) image

pzivich commented 5 years ago

When when confounders are continuous, we prefer to use MonteCarloGFormula which has an advantage on the conditional probabilities calculating of continuous covariates.

The iterative conditional g-formula allows for continuous confounders. I believe they say the covariates are discrete due to how they write it. The last cumulative product is written as Pr(...). For continuous covariates, that term would become f(...). In their application, they have several continuous variables they adjust for (weight, height, risk score)

Can I use another Machine learning algorithm?

Not currently. The issue is that confidence intervals with ~any~ certain machine learning algorithm will be incorrect (like XGBoost). More specifically, the confidence intervals will be overly narrow. LTMLE, which the paper focuses on, does allow for usage of machine learning algorithms. However, I have not implemented this yet. It is in the works

For the interative conditional g-formula, the model needs to output the probabilities, not 0 / 1, for it to estimate correctly. Due to the predict() function in sklearn, it won't return the necessary values. For it to work correct, I would need to add a custom model argument, which would use predict_proba() instead for this approach. However, because you can't get valid confidence intervals for IterativeCondGFormula with machine learning algorithms, I don't plan on adding this option.