Closed miaow27 closed 2 years ago
I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression
. That should be the same thing as GLMSL
with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here
The original motivation for GLMSL
was to get around the automatic penalization that sci-kit learn's LogisticRegression
does (since I find tuning the C
parameter to be inconsistent.
What the warning looks like is that a 'bad' value is being returned by GLMSL
somehow, which is passing a np.nan
to the targeting step. That might be some bad fitting on the GLM's part? I will see if I can replicate in the next week
Awesome! I dig further into different scenario when I only fit 1 estimator at a time for both exposure model and the outcome model using SingleCorssfit
. It seems to be coming from the outcome model for non-penalized regression (Linear regression from SKlearn or sm.familiy.Gaussian) as well as Gradient Boost Machine.
My outcome is highly skewed to the right as below:
However, I have tried cap at 90% or bound to 0-1. it is still throwing the same error.
See detailed information below.
if outcome_binary:
SL_emp = EmpiricalMeanSL()
SL_glm = LogisticRegression(penalty = 'none', tol = 1e-3, solver = 'saga', max_iter = 1000)
SL_lasso = LogisticRegression(penalty = 'l1', C = 1/lasso_alpha, tol = 1e-3, solver = 'saga', max_iter = 1000)
SL_gmsl = GLMSL(family = sm.families.family.Binomial())
SL_rf = RandomForestClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)
SL_gbm = GradientBoostingClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)
else:
SL_emp = EmpiricalMeanSL()
SL_glm = LinearRegression()
SL_lasso = Lasso(alpha = 0.1)
link_i = sm.genmod.families.links.identity()
SL_gmsl = GLMSL(family = sm.families.family.Gaussian(link=link_i))
SL_rf = RandomForestRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)
SL_gbm = GradientBoostingRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)
It seems that Either Sklearn version of Regression of Statsmodel version of Regression failed. As well as GBM. However, Lasso is okay
ps_model: SL_emp, outcome_model: SL_emp, ATE: 299.7524716151026 ps_model: SL_emp, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_emp, outcome_model: SL_lasso, ATE: 299.1976197069547 ps_model: SL_emp, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_emp, outcome_model: SL_rf, ATE: 94.01894562905046 ps_model: SL_emp, outcome_model: SL_gbm, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_emp, ATE: 147.88143010969998 ps_model: SL_glm, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_glm, outcome_model: SL_lasso, ATE: 147.93800377478118 ps_model: SL_glm, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_glm, outcome_model: SL_rf, ATE: 76.54026004386105 ps_model: SL_glm, outcome_model: SL_gbm, ATE: nan -> showing the above error
ps_model: SL_lasso, outcome_model: SL_emp, ATE: 148.62888266058008 ps_model: SL_lasso, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_lasso, outcome_model: SL_lasso, ATE: 148.65439623543404 ps_model: SL_lasso, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_lasso, outcome_model: SL_rf, ATE: 76.49970421812958 ps_model: SL_lasso, outcome_model: SL_gbm, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_emp, ATE: 104.65699066369935 ps_model: SL_gmsl, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_gmsl, outcome_model: SL_lasso, ATE: 104.65699066369935 ps_model: SL_gmsl, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_gmsl, outcome_model: SL_rf, ATE: 80.77866133286848 ps_model: SL_gmsl, outcome_model: SL_gbm, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_emp, ATE: 91.54591473217262 ps_model: SL_rf, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_rf, outcome_model: SL_lasso, ATE: 91.54591473217262 ps_model: SL_rf, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_rf, outcome_model: SL_rf, ATE: 76.42024750089442 ps_model: SL_rf, outcome_model: SL_gbm, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_emp, ATE: 50.585009347615674 ps_model: SL_gbm, outcome_model: SL_glm, ATE: nan -> showing the below error ps_model: SL_gbm, outcome_model: SL_lasso, ATE: 50.585009347615674 ps_model: SL_gbm, outcome_model: SL_gmsl, ATE: nan -> showing the below error ps_model: SL_gbm, outcome_model: SL_rf, ATE: 58.8199678774994 ps_model: SL_gbm, outcome_model: SL_gbm, ATE: nan -> showing the below error
Each line represent a test case for getting ATE for continuous outcome (everything else is same).
When there is ATE:nan
it comes with error message as below.
The reason it repeated 3 times is because I defined partition = 3.
I also notice the error occur at line 1663, 1669, 1671 and 1668 in crossfit.py within DoubleCrossfitTMLE
class.
However, I only called SingleCrossfitTMLE
.
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/a474369/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
offset=np.log(probability_to_odds(py_os))))
I hope these information can provide some insights about why Singlecrossfit failed!
I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's
LinearRegression
. That should be the same thing asGLMSL
with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening hereThe original motivation for
GLMSL
was to get around the automatic penalization that sci-kit learn'sLogisticRegression
does (since I find tuning theC
parameter to be inconsistent.
I think it tends to have this particular error for GLMSL or even LASSO (from SKlearn, when alpha is not large enough).
Yeah, it sounds like you are running into sparse data, so GLMSL
returns nan
because that's what statsmodels.GLM
returns. Those nan
get fed to the targeting model (another GLM
, the one called out in the warning).
So, it sounds like a specific sparsity issue coming up with the data (not with something in GLMSL
itself). Penalized model (like you did) is probably the way to fix it. Implicitly random forests do something like that, so that also explains why they work here.
This is very helpful insight! log-transform the outcome would be ideal, but i will not able to get ATE but ratio instead. Random Forest with parameter that prevent overfitting or large penalization might be the way to go...
Thank you!
Personally, I would recommend using SuperLearer with a LASSO, Ridge, and Random Forest (and maybe some other algorithms that regularize). That should keep you on the ATE, and all those are regularized, so they should avoid this issue.
My recommendation with using LASSO + Ridge is based on some observations that random forests can be a little unpredictable when used alone in finite samples (also see "Demystifying Statistical Inference When Using Machine Learning in Causal Research"). Essentially you want to give prevent overfitting by random forest if you can
This is extremely helpful suggestion! Thank you so much!
@pzivich, When using Singlecrossfit TMLE for a continuous outcome with sm.Gaussian GLM class. I have encountered the following error:
Here is how I defined the estimator for superleaner as well as the parameter input.
If I uses any other ML estimates such as Lasso, GBM, RandomForest from Sklearn for outcome model estimator, it will work fine. The error only related to use of GLMSL family.
Could you share any idea of the reason of this error and how I can fix this issue? Much appreciated!