Results differ between versions '0.8.1' and '0.9.0b1'

gcasamat commented 3 years ago

I have upgraded econml from '0.8.1' to '0.9.0b1' (pip install -e git+https://github.com/microsoft/EconML.git@master#egg=econml) and I changed nothing to my code. The results I obtain are however very different when fitting the following estimator:

est_forest_gb = ForestDML(
                            model_y = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                            model_t = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                            n_estimators = 200,
                            n_crossfit_splits = [(fold0, fold1), (fold1, fold0)],
                            random_state = 123)

Do you have an idea of what could explain the difference? Thanks.

vsyrgkanis commented 3 years ago

The base code for this estimator has changed significantly and is now based on a new native gerneralized random forest module.

results should still be qualititive the same.

if different then most prob you have small unexplained variation in treatmet.

I would check the reported confidence intervals.

If you give us a synthetic dataset where the behavior is very different we can see if this is due to some bug in the new code.

also feel free to paste some examples of this difference.

vsyrgkanis commented 3 years ago

Check out this notebook https://github.com/microsoft/EconML/blob/master/notebooks/Generalized%20Random%20Forests.ipynb

ForestDML is essentially deprecated by CausalForestDML and internally when you call forestsml, you’ll see a warning and a causalforestdml will be initialized

vsyrgkanis commented 3 years ago

On top of my head one constraint that the new causalforestdml imooses that was not imposed by forestdml is balancedness. By default the new code requires that at least 5% of each node belong to each child. So it ignores imbalanced splits.

You could try calling directly CausalForestDML with your parameters, but setting min_balanceness_tol=0.5

this would not impose any balanceness constraint

the otjer thing that has changed is the default setting of subsample_fr when you leave it to “auto”.

you can check the previous choice of subsample_fr on your data by accessing subsamplefr after you fit the model using the 0.8.1 release.

then in 0.9 pass that value to causalforestdml as “max_samples=subsamplefr / 2”

vsyrgkanis commented 3 years ago

Even easier: the previous version was using the following formula for subsample_fr subsamplefr = (n_samples/2)*(1-1/(2n_features+2))/(n_samples/2)

so you can calculate this yourself and call the 0.9 version causalforestdml with max_samples=subsamplefr / 2 And min_balancedness_tol=0.5

and the remainder of the parameters you were already passing.

this should have almost identical behavior.

if it’s the case we’ll consider making some changes to mirror better old and new behavior as we are transitioning

gcasamat commented 3 years ago

Thanks a lot. I will check the results with these parameters adjustments.

gcasamat commented 3 years ago

Just a clarification: n_features means the columns number of matrix X, right?

vsyrgkanis commented 3 years ago

Yeap

Sent from my iPhone

On Jan 23, 2021, at 7:33 AM, gcasamat notifications@github.com wrote:

Just a clarification: n_features means the columns number of matrix X, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gcasamat commented 3 years ago

I made some tests with the parameters you indicated yesterday. I first fitted my estimator, with the "old" code (econml '0.8.1'):

best_algo = ForestDML(
                    model_y = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                    model_t = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                    n_estimators = 200,
                    n_crossfit_splits = [(fold0, fold1), (fold1, fold0)],
                    random_state = 123)

best_algo.fit(Y.values.ravel(), T.values.ravel(), X = X[product_id_varlist+['year','week']], W = W, inference = 'auto')
results = best_algo.const_marginal_effect_inference(X[product_id_varlist+['year','week']])

print(results.population_summary(decimals = 4))

Final estimate at the product, year, week level
                Uncertainty of Mean Point Estimate               
=================================================================
mean_point stderr_mean  zstat  pvalue ci_mean_lower ci_mean_upper
-----------------------------------------------------------------
   -0.0251      0.0136 -1.8405 0.0657       -0.0475       -0.0027
      Distribution of Point Estimate     
=========================================
std_point pct_point_lower pct_point_upper
-----------------------------------------
   0.0088         -0.0353           -0.01
     Total Variance of Point Estimate     
==========================================
stderr_point ci_point_lower ci_point_upper
------------------------------------------
      0.0162        -0.0475        -0.0003
------------------------------------------

Note: The stderr_mean is a conservative upper bound.

The histogram of the point estimates is as follows:

I then repeated the same operations, but with the "new" code (econml '0.9.0b1'). The only changes are: subsample_fr_ = (X.shape[0]/2)**(1-1/(2*X.shape[1]+2))/(X.shape[0]/2) (with a returned value 0.9830025027844762) and

best_algo = CausalForestDML(
                      model_y = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                      model_t = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                      n_estimators = 200,
                      n_crossfit_splits = [(fold0, fold1), (fold1, fold0)],
                      min_balancedness_tol=0.5, max_samples=subsample_fr_ / 2, random_state = 123)

The population summary and histogram are now :

Final estimate at the product, year, week level
               Uncertainty of Mean Point Estimate               
=================================================================
mean_point stderr_mean  zstat  pvalue ci_mean_lower ci_mean_upper
-----------------------------------------------------------------
   -0.0228      0.0185 -1.2333 0.2175       -0.0532        0.0076
      Distribution of Point Estimate     
=========================================
std_point pct_point_lower pct_point_upper
-----------------------------------------
   0.0209         -0.0565          0.0123
     Total Variance of Point Estimate     
==========================================
stderr_point ci_point_lower ci_point_upper
------------------------------------------
      0.0279        -0.0654          0.027
------------------------------------------

As you can see, the results are pretty different between the "old" and "new" econml code

vsyrgkanis commented 3 years ago

Thanks @gcasamat ! That’s an interesting behavior and would be great if you could help us understand more where this qualitative difference is coming from.

to me the results seem to be within statistical error. So I wouldnt describe them as very different. The stderr of your heterogeneous estimates is roughly 0.02 and your point estimate is roughly -0.02. So anything in the range -0.06 to 0.02 would be the truth for most predictions. And so this is consistent in both estimators.

Also the mean estimate (ate) and its uncertainty seems to me almost the same. The way random seed is used in the two codes is different and so this could be the reason for the small changes.

the main qualitatice difference is that the new causal forest dml produces more dispersed point estimates and more centered. Lets investigate more why this happens.

could you give us any synthetic data that reproduce this behavior?

also can you try the following: the new forest dml has a different setting of subforest size for bootstrap of little bags which could be the difference.

can tou try running the new code with subforest_size=40. This would be closer to what thw old code would be doing

gcasamat commented 3 years ago

As suggested, I have fitted:

est_forest_gb = CausalForestDML(
                            model_y = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                            model_t = GradientBoostingRegressor(n_estimators = 200, random_state = 42), 
                            n_estimators = 200,
                            n_crossfit_splits = [(fold0, fold1), (fold1, fold0)],
                            min_balancedness_tol=0.5, max_samples=subsample_fr_ / 2, subforest_size=40
                            , random_state = 123)

and obtained :


                Uncertainty of Mean Point Estimate               
=================================================================
mean_point stderr_mean  zstat  pvalue ci_mean_lower ci_mean_upper
-----------------------------------------------------------------
    -0.021      0.0176 -1.1936 0.2326       -0.0498        0.0079
      Distribution of Point Estimate     
=========================================
std_point pct_point_lower pct_point_upper
-----------------------------------------
   0.0223         -0.0571           0.018
     Total Variance of Point Estimate     
==========================================
stderr_point ci_point_lower ci_point_upper
------------------------------------------
      0.0284        -0.0634         0.0319
------------------------------------------

gcasamat commented 3 years ago

These are real data (not synthetic). I could however send it to you for testing. How should I proceed?

Let me explain a bit more my concern with the new code results. I estimate elasticities with the model : Y = theta(X)*T + g(X,W) I then fit g, based on the estimated elasticities, in order to get predictions for Y. It turns out that some predictions with the new code completely "miss the target" (and it was not the case with the previous code).

vsyrgkanis commented 3 years ago

If these are public data then that would be great. If not then I am afraid that we would need to make sure that privacy is not compromised. That would take a while and doesnt seem viable.

A better route would be if you could generate semi-synthetic data with similar marginal distributions where you observe the same qualitative behavior.

can you also share the post processing code you mentioned where you fit the g function? What do you mean by “miss”?

One more reason why the new code could differ from the old code is if scokit learn performs tree pruning after fitting. The new code does not do tree pruning. I will investigate

vsyrgkanis commented 3 years ago

By default no pruning is performed by sklearn. So that is not the source of discrepancy.

gcasamat commented 3 years ago

Here is the code for fitting the g function:

coef_price = best_algo.const_marginal_effect(data_for_dml[product_id_varlist+['year','week']])
temp = data_for_dml[treatment_varlist].multiply(coef_price,axis='index')
data_for_dml['temp'] = temp
Z = data_for_dml[dep_var] - data_for_dml.temp    
rf = RandomForestRegressor(n_estimators = 100, oob_score = True, random_state = 40, max_features = 0.33)           
rf.fit(W, Z)

By "miss", I simply mean large residuals between the predicted and actual values of Y. Finally, concerning the data, these are private, but anonymized, and I could send you a sample for testing. I unfortunately do not have enough time right now to generate synthetic data.

gcasamat commented 3 years ago

It seems that there are pretty easy ways to generate synthetic from real data in python (PySynth). I could try this ? Or maybe you have some advice on better alternatives to proceed. Thanks.

vsyrgkanis commented 3 years ago

This looks very interesting. Didn't know of this library. Yes if you could create such synthetic data and then publish the synthetic data here (E.g. upload some csv with code that analyzes it), that would be great.

Regarding your code: do you have a single treatment? Otherwise the "multiply" operation is producing a vector of numbers theta_i(X) * T_i, for each treatment T_i. And waht you want is to produce the summation of all these numers. You could also get this by calling best_algo.effect(X, T1=T), where T is the treatment level of each sample. This will produce the inner product directly. Also does W contain X too? I would include the concantenation of X and W in the final regression.

Another point. Another difference between the old and new code: 1) try setting min_samples_leaf=5 on both versions of the code. The 0.8 code has default min_samples_leaf=1 and the 0.9 has min_samples_leaf=5. 2) There is another discrepancy between the two codes that you cannot do anything about: the old code was first creating tree on half the data and then populating the values with the other half, removing nodes that don't happen to have any samples on the second half. So some sort of pruning. The new code, combines this, and when it trains the tree it checks that it has at least min_samples_leaf data on both half samples. And then chooses the best split among these valid splits. The 0.8 code, could be inducing shallower trees, because it prunes nodes and hence regularizing more around the mean, which is what you observe in the histogram. While the new code has more chances of creating deep trees.

One thing you could do to avoid this is to tune the max_depth and the min_samples_leaf parameters.

Our 0.9 version has two cool new additions that allow you to do that: the score.RScorer and the refit_final method.

You can call causalforestdml().fit(....., cache_values=True).

Then you can change any of the forest parameters and call refit_final(). This would just fit the final stage which is faster. Then instantiate an RScorer() object and fit it on a validation set. Then score each parameter of the causalforestdml on this validation set and pick the hyperparam with the highest score. Check out our two notebooks on how to use these: https://github.com/microsoft/EconML/blob/master/notebooks/Double%20Machine%20Learning%20Examples.ipynb https://github.com/microsoft/EconML/blob/master/notebooks/Causal%20Model%20Selection%20with%20the%20RScorer.ipynb

e.g. in your case:

est = CausalForestDML()
scorer = RScorer()
y_train, y_val, T_train, T_val, X_train, X_val, W_train, W_val = train_test_split(y, T, X, W, train_size=.7)

est.fit(y_train, T_train, X=X_train, W=W_train, cache_values=True)
scorer.fit(y_val, T_val, X=X_val, W=W_val)
scorer.score(est)
est.min_samples_leaf = 20
est.refit_final()
scorer.score(est)
est.min_samples_leaf = 50
est.refit_final()
scorer.score(est)
est.min_samples_leaf = 5
est.max_depth = 3
est.refit_final()
scorer.score(est)

Then choose the best hyperparameter with the highest score and refit the causalforest dml on all the data with that hyper parameter.

vsyrgkanis commented 3 years ago

Also could you print here, one tree from the old and one from the new code. You can do that in the 0.8 version with:

from sklearn.tree import plot_tree
plot_tree(est.model_cate[0])

and in the new version with:

plot_tree(est[0][0])

gcasamat commented 3 years ago

Great! I will try to generate the synthetic data and send it by tomorrow.

Regarding your questions: yes, I have a single treatment (the code was written to accommodate multiple treatments) and yes, W does contain X.

vsyrgkanis commented 3 years ago

So defikitelu concatenate w and x in the last regression.

And also try setting larger min leaf size and the hyperparameter tuning proposals.

Not sure if/when we’ll be able to do anything on the synthetic data on our side but would be great to have when we get to it. Let’s continue figuring out what is the source of the qualitative difference on this issue.

Sent from my iPhone

On Jan 24, 2021, at 1:29 PM, gcasamat notifications@github.com wrote:

Great! I will try to generate the synthetic data and send it by tomorrow.

Regarding your questions: yes, I have a single treatment (the code was written to accommodate multiple treatments) and yes, W does contain X.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

gcasamat commented 3 years ago

This is fine with me.

Just one thing I don't understand: why should I concatenate W and X, as W already contains X? Wouldn't it imply that the columns of X be duplicated in the matrix of predictors of the random forest?

vsyrgkanis commented 3 years ago

Oh ok. I misread. If W contains both then it's fine. Jus tone note: our library assumes that W and X are disjoint and internally concatenates when estimating the first stage models, so if you pass W and X to dml estimators and W contains X, then internally we will be estimating models where X is duplicated. So I'd recommend that W and X be disjoint when passed to dml.

Finally: I tried some experiments and I do think that your main source of qualitiative difference is this discrepancy between whether we check the min_samples_leaf constraint on the validation half sample during training, or whether we prune retrospectively leafs that are empty on the validation set. The latter induces much shallower trees than the former. So your 0.8 model should have much shallower trees.

I would then definitely either try hyperparameter tuning as described or definitely increase the min_leaf_size in CausalForestDML or the decrease the max_depth (e.g. min_leaf_size=30, max_depth=3 or 5).

That should make the model much more stable and closer to the 0.8 variant.

gcasamat commented 3 years ago

Many thanks for all these advices!

I just have one last question: are the default settings in CausalForestDML the same than in the R grf software?

vsyrgkanis commented 3 years ago

Yes they are trying to emulate the same defaults for uniformity with the R package.

vsyrgkanis commented 3 years ago

Here is a complete sample code for hyperparam tuning:

from econml.score import RScorer
from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
ytrain, yval, Ttrain, Tval, Xtrain, Xval, Wtrain, Wval = train_test_split(Y, T, X, W, train_size=.7)
est = CausalForestDML(model_y=model_y,
                      model_t=model_t,
                      discrete_treatment=True,
                      cv=3,
                      n_estimators=100,
                      min_samples_split=2,
                      random_state=123)
est.fit(ytrain, Ttrain, X=Xtrain, W=Wtrain, cache_values=True)
scorer = RScorer(model_y=model_y, model_t=model_t, discrete_treatment=True, cv=3)
scorer.fit(yval, Tval, X=Xval, W=Wval)

scores = []
est.min_var_leaf_on_val = True
est.fit_intercept = False
for max_samples in [.3, .5]:
    est.max_samples = max_samples
    for min_balancedness_tol in [.5]:
        est.min_balancedness_tol = min_balancedness_tol
        for min_samples_leaf in [10, 50]:
            est.min_samples_leaf = min_samples_leaf
            for max_depth in [3, 5]:
                est.max_depth = max_depth
                for min_var_fraction_leaf in [None, .01]:
                    est.min_var_fraction_leaf = min_var_fraction_leaf
                    est.refit_final()
                    scores.append((scorer.score(est), max_samples,
                                   min_balancedness_tol, min_samples_leaf, 
                                   max_depth, min_var_fraction_leaf))

bestind = np.argmax(np.array(scores)[:, 0])
scores[bestind]

best_score, max_samples, min_balancedness_tol, min_samples_leaf, max_depth, min_var_fraction_leaf = scores[bestind]
est.max_samples = max_samples
est.min_balancedness_tol = min_balancedness_tol
est.min_samples_leaf = min_samples_leaf
est.max_depth = max_depth
est.min_var_fraction_leaf = min_var_fraction_leaf
est.n_estimators = 4000
est.fit(Y, T, X=X, W=W, cache_values=True)

gcasamat commented 3 years ago

Thanks a lot!

vsyrgkanis commented 3 years ago

and here is a more generic version that you can easily add hyperparams to tune:

from econml.score import RScorer
from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
from itertools import product

ytrain, yval, Ttrain, Tval, Xtrain, Xval, Wtrain, Wval = train_test_split(Y, T, X, W, train_size=.7)
est = CausalForestDML(model_y=model_y,
                      model_t=model_t,
                      discrete_treatment=True,
                      cv=3,
                      n_estimators=100,
                      min_samples_split=2,
                      random_state=123)
est.fit(ytrain, Ttrain, X=Xtrain, W=Wtrain, cache_values=True)

scorer = RScorer(model_y=model_y, model_t=model_t, discrete_treatment=True, cv=3)
scorer.fit(yval, Tval, X=Xval, W=Wval)

grid = {'max_samples': [.3, .5],
        'min_balancedness_tol': [.3, .5],
        'min_samples_leaf': [10, 50],
        'max_depth': [5, None],
        'min_var_fraction_leaf': [None, .01]}

names = grid.keys()
scores = []
for values in product(*grid.values()):
    for key, value in zip(names, values):
        setattr(est, key, value)
    est.refit_final()
    scores.append((scorer.score(est), tuple(zip(names, values))))

bestind = np.argmax([s[0] for s in scores])
best_score, params = scores[bestind]
for key, value in params:
    setattr(est, key, value)
est.n_estimators = 4000
est.fit(Y, T, X=X, W=W, cache_values=True)

pred = est.effect(X_test)
lb, ub = est.effect_interval(X_test, alpha=0.01)

py-why / EconML

Results differ between versions '0.8.1' and '0.9.0b1' #387