sb-ai-lab / Py-Boost

Python based GBDT implementation on GPU. Efficient multioutput (multiclass/multilabel/multitask) training
Apache License 2.0
163 stars 13 forks source link

Py-BoostLSS: An extension of Py-Boost to probabilistic modelling #1

Open StatMixedML opened 1 year ago

StatMixedML commented 1 year ago

Dear Py-Boost developers,

Thanks for the very interesting paper and for making the code publicly available.

I am the author of XGBoostLSS and LightGBMLSS that extend the base implementations to a probabilistic setting, where all moments of a parametric univariate and multivariate distribution are modeled as functions of covariates. This allows one to create probabilistic predictions from which intervals and quantiles of interest can be derived.

However, as outlined in my latest paper März, Alexander (2022), Multi-Target XGBoostLSS Regression, XGBoost does not scale very well for the multivariate setting, since a separate tree is grown for each parameter individually. As an example, consider modelling a multivariate Gaussian distribution with D=100 target variables, where the covariance matrix is approximated using the Cholesky-Decomposition. Modelling all conditional moments (i.e., means, standard-deviations and all pairwise correlations) requires estimation of D(D + 3)/2 = 5,150 parameters.

I came across your approach just recently and spent the last few days extending your base model to Py-BoostLSS: An extension of Py-Boost to probabilistic modelling. Because it is very runtime efficient, SketchBoost is a good candidate for estimating high-dimensional target variables. The package is in a very early stage and I need to evaluate the runtime efficiency against XGBoostLSS.

@btbpanda I was wondering if you would be interested in a scientific collaboration to further extend the functionality of Py-BoostLSS. Looking forward to your reply.

valeman commented 1 year ago

Коллеги это все весьма старые методы которые не работают и выдают смещённые прогнозы и не имеют никаких вероятностных гарантий.

сейчас номер один метод для прогнозирования неопределенности это конформные предсказания. Уже реализованы во многих библиотеках на Западе включая MAPIE, NeuralProphet, SKTime

https://github.com/valeman/awesome-conformal-prediction

StatMixedML commented 1 year ago

@valeman Thanks for your comment.

Yet I need to disagree with your comment, in particular with

these are all very old methods that do not work

Unfortunately, this is a very one-sided view and things are a little more nuanced. If calibrated prediction intervals are all you need, then conformal inference is a valid tool. Yet, creating prediction intervals are not the same thing as conditional density modelling and prediction. This is what the Py-BoostLSS model does. In some sense, you get prediction intervals from conditional density estimation by calculating quantiles from the predicted distribution. Also, if you want to draw conclusions from what drives conditional moments of a distribution, say variance, skewness etc., then conformal prediction can't help you here.

I don’t think that your general statement is true, since GAMLSS models are well established in the statistical literature. Also, I want to know why you believe that “these are all very old methods that do not work”? I know you have posted the evaluation of Ngboost, but not sure if that allows to draw a general conclusion. Also, what kind of uncertainty does conformal prediction model: epistemic, aleatoric or both? This question relates to if conformal prediction models parameter-uncertainty (like a posterior distribution of the parameter) or the true underlying data-uncertainty that is inherent in any data-generating process, where we sample from an underlying population.

btbpanda commented 1 year ago

Hey, @StatMixedML @valeman !

@StatMixedML Thanks for your feedbak. I am happy you like our work and find it useful to your research. I believe that what you propose is one of the features we shuold implement and I have probabilistic modelling in my TODO list. So we definetly should make the research collaboration. If you need any help or ideas how to implement some features inside the Py-Boost - just let me know. I am going to learn your repo and paper soon.

There are however the limitation, that you need to keep in mind working with multioutput targets with Py-Boost. It is single GPU only, may be latter I will add Multi GPU, but now your dataset should fit the GPU memory. For 5k dimensions it could be problem if you have many rows. Before I tried on 2.5k targets at max for the real world datasets. But yes, if it fits, it will be order of magnitude faster than others.

@valeman thanks to your reply. I think, Py-Boost is flexible enough to implement multiple approaches so we should check in practice what performs better in terms of both speed and accuracy

StatMixedML commented 1 year ago

@StatMixedML Thanks for your feedbak. I am happy you like our work and find it useful to your research. I believe that what you propose is one of the features we shuold implement and I have probabilistic modelling in my TODO list. So we definetly should make the research collaboration. If you need any help or ideas how to implement some features inside the Py-Boost - just let me know. I am going to learn your repo and paper soon.

Thanks for your feedback. Sound greats! Let me know if you have some comments/suggestions on the repo. Please note, it is in a very early stage and subject to improvements. I suggest you start reading the Multi-Target XGBoostLSS paper first to get started.