Open Akshay1006 opened 3 years ago
Why do u need to do one prediction at a time?
Also if you care mostly about point estimates and not confidence intervals you can substantially reduce the number of trees from the default 4000 to sth like 400
I need a prediction for a single record for the real time implementation of this model. In real time, we will get a call and we want to score basis that. Also, i am running with Causal Forest -DML with 50 trees. Not sure i can optimize it further or not.
Oh sorry I did not see "milliseconds"!! I read seconds.....
Hm what is the latency for sklearn on a similar problem? Our code is based on sklearn cython base code so we wont be able to improve beyond that.
Other alternatives to improve latency, would be to reduce max_depth, e.g. max_depth=3 or 5. This can induce some bias to the model, but if latency is more important it would help.
Got it. We will check the latency for Sklearn cython base. We were previously working on Xgboost and the latency there is close to 2-3 ms. We have done the max depth reduction thing but it did not change thing a lot differently. Meanwhile, is there any way through which i can extract the trees from the training file and then write a custom prediction function on top of it to save the latency.
That seems scary to introduce logic bugs. For isntance in a causal forest the prediction is not the average of the tree predictions but something different. If you could paste the training code, would be helpful
This is the training code for Causal Forest-DML: est = CausalForestDML(model_y=LGBMRegressor(max_depth = 9,importance_type = 'gain',n_estimators = 50, objective='regression',min_child_samples = 200,colsample_bytree = 0.8), model_t=LGBMClassifier(max_depth = 9,importance_type = 'gain',n_estimators = 50, objective='binary',min_child_samples = 200,colsample_bytree = 0.8), n_jobs = -1, max_depth = 7, discrete_treatment=True, n_estimators=60, min_samples_leaf=200, min_impurity_decrease=0.001, verbose=0)
Also, one naive question on the scoring logic. The tree is split so that treatment effects are maximized and then at the time of scoring, aren't we just taking an average of treatment effect across those trees depending on which leaf the output falls in? if yes, how's is it different from scoring in random forest?
Hi Team - We have worked on building a CausalForestDML on EconML package and want to take it to Production. However, we are failing in terms of meeting the latency requirement by running the .effect function to get the heterogenous treatment effect. Current latency is coming around ~20ms whereas the same number for xgboost is around 2-3 ms. Our hypothesis is that .effect function is optimal for doing the batch prediction whereas when we do a single row prediction, time increases. Is there any way to optimize it to reduce the latency of the current model.