Adding prediction intervals for GradientBoostingRegressor and review RandomForest prediction intervals - https://alice.its.cern.ch/jira/browse/ATO-459

miranov25 / RootInteractive

5 stars 12 forks source link

Adding prediction intervals for GradientBoostingRegressor and review RandomForest prediction intervals - https://alice.its.cern.ch/jira/browse/ATO-459 #67

Open miranov25 opened 3 years ago

miranov25 commented 3 years ago

GradientBoostingRegressor wrapper should be added to the list of wrappers i RootInteractive:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb

To be integrated in similar way also QuantileRegressionForest -clone https://scikit-garden.github.io/examples/QuantileRegressionForests/ https://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf

Finally, quantile regression is not available for all types of regression models. In scikit-learn, the only 
model that implements it is the Gradient Boosted Regressor. Sometimes, such as in the case of XGBoost, 
you can customize the model’s cost-function to  obtain quantile regressor. You can read the details of how to do it here.

miranov25 commented 3 years ago

In order to include new regression, classifiers - MLpipeline code to be restructured https://alice.its.cern.ch/jira/browse/ATO-459

Current version (TO BE deprecated)

design influenced by TMVA - does not scale
fitter, regressor created in fit function based on the names and options
- method parameter defined in Register_Method
- model created during the fit method
- many if, does not scale
  New version - to be implemented
models (regression, quantile regression wrappers) to be constructed by users
wrappers implement additional common functionality
models registered in Register_model
models reused for fits

miranov25 commented 3 years ago

Reference - GradientBoostingRegressor https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

quantiles obtained during training time - using appropriate cost function

loss{‘ls’, ‘lad’, ‘huber’, ‘quantile’}, default=’ls’

loss function to be optimized. ‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile > regression (use alpha to specify the quantile).

Other References

https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0 https://www.evergreeninnovations.co/blog-quantile-loss-function-for-machine-learning/

miranov25 commented 3 years ago

Deep quantile regression:

based on the cost function discussed in the:

https://alice.its.cern.ch/jira/browse/ATO-459?focusedCommentId=253647&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-253647

Blogs:
- https://towardsdatascience.com/deep-quantile-regression-c85481548b5a
- https://www.evergreeninnovations.co/blog-quantile-loss-function-for-machine-learning/

describe scalar version - one quantile per Neural network.

Quantile vector implementation in Jupyter notebook: https://github.com/strongio/quantile-regression-tensorflow/blob/master/Quantile%20Loss.ipynb

miranov25 commented 3 years ago

Quantile regression interface:

In general - quantiles should be defined before fitting (not needed in the Scikit -garden - but skgradern not supported anymore)

BDTs and neural nets should be constructed knowing which quantiles are needed

BDTs
- For BDT array of the regressor to be created for each quantile
Deep neural nets options
- Array of neural nets create for each quantile:
- Slower -
- Non consistent - quantiles prediction could be not sorted
- Bigger variance
- One neural net for all quantile predictions

Proposed interface:

init
fit
predict(+index)
appendStatPandas(options)
- append statistics to the panda data frame
- by default all options
RMS estimators based on the quantiles
- some approximation has to be done appendStatTree ?
- append statistic to the tree for later usage

miranov25 / RootInteractive