stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.63k stars 214 forks source link

Does NGBoost need iid in training set? #294

Closed CadePGCM closed 1 year ago

CadePGCM commented 1 year ago

I know that for typical gradient boosting algorithms like xgboost or lgbm this is often the case.

Is it also true for ngboost? I'm seeing signficant improvement on non-iid data using NGBoost over the above in classification (which NGBoost was not really designed to improve) so I'm curious on the theory.

alejandroschuler commented 1 year ago

No method for supervised learning (probabilistic or otherwise) strictly requires IID data. In general the prediction target is a functional minimizer of a given loss, which is well-defined even if there is inter-observation dependence. For example, when doing point prediction with MSE loss the thing you end up estimating is the conditional mean of $Y$ given $X$. It may be the case that $Y_i$ actually depends on $X_j$ as well as on $X_i$, but the prediction function $f(X_i) = E[Y_i|X_i]$ is still perfectly well-defined (just marginalize over $X_j$).

The thing you are trying to unbiasedly estimate at the end of the day is the prediction error. As long as a) your test set is drawn fairly from the same distribution as the data that you plan to eventually deploy the model to predict on and b) the statistical dependence between your training and test sets is the same as between the training data and the future deployment data, then your test-set error will be an unbiased estimate of the future generalization error. So you might want to do a training/test split by cluster instead of by individual, for example. Or you may need to do the splitting in a way that respects a time ordering.

NGboost is no different. Here our target of inference is the full conditional distribution $P_{Y_i|X_i}$ which again is always well-defined. We're just averaging over all the additional variation that comes from inter-unit dependence.

alejandroschuler commented 1 year ago

(closing but feel free to continue discussion if you have more questions!)

osorensen commented 1 year ago

Hello @alejandroschuler and @CadePGCM, I'd like to follow up a bit on this question. While the case you explain above @alejandroschuler makes perfect sence, there are many applications where the observed outcomes can be correlated.

That is, what if $E[Y{i} | X{i}, Y{j}] \neq E[Y{i}|X{i}]$ for some $i \neq j$. Typical cases can be when $Y{i}$ and $Y_{j}$ are observations of the same person, taken at two different points in time, or when there is spatial dependency. This would lead to poorly calibrated confidence bands and biased estimates. Agree?

There have been some attempts at defining gradient boosting methods that handles this situation by introducing additional variance parameters, so that within-subject and between-subject variation is separated, e.g., https://www.degruyter.com/document/doi/10.1515/ijb-2020-0136/html. However, these methods do not seem to scale well, and I haven't seen use of regression trees as base learners in this case. There is however a very nice paper on random forests that seems to tackle many of these issues https://www.tandfonline.com/doi/full/10.1080/01621459.2021.1950003.

To me, this seems like a very open research problem, but perhaps NGBoost provides the right framework? I'd love to hear if you have any thoughts on this.

alejandroschuler commented 1 year ago

I think my argument still applies: just marginalize over $Y_j$. So nothing is necessarily biased if what you're defining "bias" relative to is the conditional (marginal) density $P(Y_i|X_i)$. What is true is that you're leaving money on the table by not using all the available information. So your prediction bands are going to be wider than they otherwise could be, and your point estimates more variable than they otherwise could be. So certainly an area for methodological innovation and could be for sure be done in the ngboost framework.