py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.71k stars 701 forks source link

V0.15.0 runs hours longer than V0.14.0 #895

Open Jiaqi-ads opened 1 month ago

Jiaqi-ads commented 1 month ago

Hi EconML team,

I've just upgraded my EconML package to V0.15.0 and it seems like the new version runs much slower than the V0.14.0, even with one the simplest CATE estimators. For example, I've trained a linear DR model using v0.14.0 within less than 5 minutes but yet it took me hours to train the same linear DR model (i.e. all variables and datasets used remain unchanged). I wonder what has changed in the V0.15.0 that might lead to this problem?

kbattocchi commented 1 month ago

The only change that I can think of is that we have changed the default first-stage propensity and regression models to do model selection between linear and forest models instead of always just using a linear model.

We made this change because the accuracy of the CATE estimate depends strongly on having good models, and for many datasets we'd expect forest models to fit the data much better. In general, this has not resulted in large slowdowns in our own internal testing, but perhaps you have a much larger number of rows or columns than we've been testing on - what are the shapes of your Y, T, X, and W inputs?

If fitting forest models is the cause of the slowdown, you can explicitly pass first-stage models of your choice instead. However, as I mentioned it is important to use models that can actually fit your data well if you want to get accurate CATE estimates, so I would only fall back on linear models if you are confident that those have good predictive power in your setting.

As a side note, we released v0.15.1 yesterday, which contains some bugfixes, so you may want to upgrade to that, but I don't expect it to affect your performance issues if the cause is what I've outlined above.

Jiaqi-ads commented 1 month ago

Thanks for your prompt response! @kbattocchi

The dataset I was testing on contains about 500,000 rows and have about 50 columns in X and W combined, which consists of mostly the one-hot encoded categorical variables. So maybe it is because of the changes in the default first stage models?

On the accuracy of the first-stage models though, although I agree that forest models tend to have better accuracy and more accurate first-stage models lead to better CATE estimation, I'm aware that there are some arguments saying that forest models tend to generate more extreme probability scores in classification tasks. This could probably affect both the outputs of propensity model and the "regression model" as well if the outcome variable is binary, which ultimately affects the performance of the final CATE model. May I ask what your thoughts are on this? Thanks in advance.

Jiaqi-ads commented 1 month ago

Hi, just wanted to follow up on the issue of speed. I've upgraded the module to v0.15.1 and tried to set both the model_propensity and model_regression to 'linear'. It still took hours to finish training on the dataset whereas it took only four minutes with v0.14.0. Besides, the execution time was the same as setting those parameters to 'auto' and changing the parameters to 'forest' doesn't affect the execution time much either. So I wonder if there could be some other issues?