pzivich / zEpid

Epidemiology analysis package
http://zepid.readthedocs.org
MIT License
142 stars 33 forks source link

SingleCrossFit `invalid value encountered in log` #169

Closed miaow27 closed 1 year ago

miaow27 commented 1 year ago

@pzivich, When using Singlecrossfit TMLE for a continuous outcome with sm.Gaussian GLM class. I have encountered the following error:

xxx/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
xxx/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))

Here is how I defined the estimator for superleaner as well as the parameter input.

link_i = sm.genmod.families.links.identity()
SL_glm = GLMSL(family = sm.families.family.Gaussian(link=link_i))
GLMSL(family = sm.families.family.Binomial())

sctmle = SingleCrossfitTMLE(dataset = df, exposure='treatment', outcome='paid_amt', continuous_bound = 0.01)
sctmle.exposure_model('gender_cd_F + prospective_risk + age_nbr', GLMSL(family = sm.families.family.Binomial()), bound=[0.01, 0.99])
sctmle.outcome_model('gender_cd_F + prospective_risk + age_nbr', SL_glm)
sctmle.fit(n_splits = 2, n_partitions=3, random_state=12345, method = 'median')
sctmle.summary()

If I uses any other ML estimates such as Lasso, GBM, RandomForest from Sklearn for outcome model estimator, it will work fine. The error only related to use of GLMSL family.

Could you share any idea of the reason of this error and how I can fix this issue? Much appreciated!

pzivich commented 1 year ago

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

pzivich commented 1 year ago

What the warning looks like is that a 'bad' value is being returned by GLMSL somehow, which is passing a np.nan to the targeting step. That might be some bad fitting on the GLM's part? I will see if I can replicate in the next week

miaow27 commented 1 year ago

Awesome! I dig further into different scenario when I only fit 1 estimator at a time for both exposure model and the outcome model using SingleCorssfit. It seems to be coming from the outcome model for non-penalized regression (Linear regression from SKlearn or sm.familiy.Gaussian) as well as Gradient Boost Machine.

My outcome is highly skewed to the right as below: image

However, I have tried cap at 90% or bound to 0-1. it is still throwing the same error.

See detailed information below.

Detailed Estimator Defined

if outcome_binary:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LogisticRegression(penalty = 'none', tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_lasso = LogisticRegression(penalty = 'l1', C = 1/lasso_alpha, tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_gmsl = GLMSL(family = sm.families.family.Binomial())
        SL_rf = RandomForestClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
    else:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LinearRegression()
        SL_lasso = Lasso(alpha = 0.1)
        link_i = sm.genmod.families.links.identity()
        SL_gmsl = GLMSL(family = sm.families.family.Gaussian(link=link_i))
        SL_rf = RandomForestRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)

Result and Error

Result

It seems that Either Sklearn version of Regression of Statsmodel version of Regression failed. As well as GBM. However, Lasso is okay

ps_model: SL_emp, outcome_model: SL_emp, ATE: 299.7524716151026 ps_model: SL_emp, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_emp, outcome_model: SL_lasso, ATE: 299.1976197069547 ps_model: SL_emp, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_emp, outcome_model: SL_rf, ATE: 94.01894562905046 ps_model: SL_emp, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_glm, outcome_model: SL_emp, ATE: 147.88143010969998 ps_model: SL_glm, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_glm, outcome_model: SL_lasso, ATE: 147.93800377478118 ps_model: SL_glm, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_glm, outcome_model: SL_rf, ATE: 76.54026004386105 ps_model: SL_glm, outcome_model: SL_gbm, ATE: nan -> showing the above error

ps_model: SL_lasso, outcome_model: SL_emp, ATE: 148.62888266058008 ps_model: SL_lasso, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_lasso, outcome_model: SL_lasso, ATE: 148.65439623543404 ps_model: SL_lasso, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_lasso, outcome_model: SL_rf, ATE: 76.49970421812958 ps_model: SL_lasso, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gmsl, outcome_model: SL_emp, ATE: 104.65699066369935 ps_model: SL_gmsl, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_gmsl, outcome_model: SL_lasso, ATE: 104.65699066369935 ps_model: SL_gmsl, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_gmsl, outcome_model: SL_rf, ATE: 80.77866133286848 ps_model: SL_gmsl, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_rf, outcome_model: SL_emp, ATE: 91.54591473217262 ps_model: SL_rf, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_rf, outcome_model: SL_lasso, ATE: 91.54591473217262 ps_model: SL_rf, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_rf, outcome_model: SL_rf, ATE: 76.42024750089442 ps_model: SL_rf, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gbm, outcome_model: SL_emp, ATE: 50.585009347615674 ps_model: SL_gbm, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_gbm, outcome_model: SL_lasso, ATE: 50.585009347615674 ps_model: SL_gbm, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_gbm, outcome_model: SL_rf, ATE: 58.8199678774994 ps_model: SL_gbm, outcome_model: SL_gbm, ATE: nan -> showing the below error

Error Message

Each line represent a test case for getting ATE for continuous outcome (everything else is same). When there is ATE:nan it comes with error message as below.

The reason it repeated 3 times is because I defined partition = 3.

I also notice the error occur at line 1663, 1669, 1671 and 1668 in crossfit.py within DoubleCrossfitTMLE class. However, I only called SingleCrossfitTMLE.

/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/a474369/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))

I hope these information can provide some insights about why Singlecrossfit failed!

miaow27 commented 1 year ago

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

I think it tends to have this particular error for GLMSL or even LASSO (from SKlearn, when alpha is not large enough).

pzivich commented 1 year ago

Yeah, it sounds like you are running into sparse data, so GLMSL returns nan because that's what statsmodels.GLM returns. Those nan get fed to the targeting model (another GLM, the one called out in the warning).

So, it sounds like a specific sparsity issue coming up with the data (not with something in GLMSL itself). Penalized model (like you did) is probably the way to fix it. Implicitly random forests do something like that, so that also explains why they work here.

miaow27 commented 1 year ago

This is very helpful insight! log-transform the outcome would be ideal, but i will not able to get ATE but ratio instead. Random Forest with parameter that prevent overfitting or large penalization might be the way to go...

Thank you!

pzivich commented 1 year ago

Personally, I would recommend using SuperLearer with a LASSO, Ridge, and Random Forest (and maybe some other algorithms that regularize). That should keep you on the ATE, and all those are regularized, so they should avoid this issue.

My recommendation with using LASSO + Ridge is based on some observations that random forests can be a little unpredictable when used alone in finite samples (also see "Demystifying Statistical Inference When Using Machine Learning in Causal Research"). Essentially you want to give prevent overfitting by random forest if you can

miaow27 commented 1 year ago

This is extremely helpful suggestion! Thank you so much!