py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.73k stars 706 forks source link

EconML package failing the latency in Production #497

Open Akshay1006 opened 3 years ago

Akshay1006 commented 3 years ago

Hi Team - We have worked on building a CausalForestDML on EconML package and want to take it to Production. However, we are failing in terms of meeting the latency requirement by running the .effect function to get the heterogenous treatment effect. Current latency is coming around ~20ms whereas the same number for xgboost is around 2-3 ms. Our hypothesis is that .effect function is optimal for doing the batch prediction whereas when we do a single row prediction, time increases. Is there any way to optimize it to reduce the latency of the current model.

vsyrgkanis commented 3 years ago

Why do u need to do one prediction at a time?

vsyrgkanis commented 3 years ago

Also if you care mostly about point estimates and not confidence intervals you can substantially reduce the number of trees from the default 4000 to sth like 400

Akshay1006 commented 3 years ago

I need a prediction for a single record for the real time implementation of this model. In real time, we will get a call and we want to score basis that. Also, i am running with Causal Forest -DML with 50 trees. Not sure i can optimize it further or not.

vsyrgkanis commented 3 years ago

Oh sorry I did not see "milliseconds"!! I read seconds.....

Hm what is the latency for sklearn on a similar problem? Our code is based on sklearn cython base code so we wont be able to improve beyond that.

Other alternatives to improve latency, would be to reduce max_depth, e.g. max_depth=3 or 5. This can induce some bias to the model, but if latency is more important it would help.

Akshay1006 commented 3 years ago

Got it. We will check the latency for Sklearn cython base. We were previously working on Xgboost and the latency there is close to 2-3 ms. We have done the max depth reduction thing but it did not change thing a lot differently. Meanwhile, is there any way through which i can extract the trees from the training file and then write a custom prediction function on top of it to save the latency.

vsyrgkanis commented 3 years ago

That seems scary to introduce logic bugs. For isntance in a causal forest the prediction is not the average of the tree predictions but something different. If you could paste the training code, would be helpful

Akshay1006 commented 3 years ago

This is the training code for Causal Forest-DML: est = CausalForestDML(model_y=LGBMRegressor(max_depth = 9,importance_type = 'gain',n_estimators = 50, objective='regression',min_child_samples = 200,colsample_bytree = 0.8), model_t=LGBMClassifier(max_depth = 9,importance_type = 'gain',n_estimators = 50, objective='binary',min_child_samples = 200,colsample_bytree = 0.8), n_jobs = -1, max_depth = 7, discrete_treatment=True, n_estimators=60, min_samples_leaf=200, min_impurity_decrease=0.001, verbose=0)

Akshay1006 commented 3 years ago

Also, one naive question on the scoring logic. The tree is split so that treatment effects are maximized and then at the time of scoring, aren't we just taking an average of treatment effect across those trees depending on which leaf the output falls in? if yes, how's is it different from scoring in random forest?