Multioutput chained regression

astrogilda commented 4 years ago

Hello,

My problem calls from prediction of two output labels. While I can do this by creating two independent models, my outputs are co-dependent, and I'd like to use the RegressorChain wrapper from sklearn--it uses the union of the prediction of the 1st model and the input features to predict and output of the 2nd model.

For point predictions using other models, it's trivial to do this exercise. However, since ngboost outputs both mu and sigma (say assuming a normal output for both labels), does this mean that to get predictions for the second label, I should append both pred_mu_label1 AND pred_sigma_label1 to the input features (to use as training set for the second model), or ONLY pred_mu_label1? If the former, it throws an error since sklearn does not support the 'pred_dist' method required to output the full distribution; do you have any suggestions to get around this?

alejandroschuler commented 4 years ago

Interesting... I'd say it depends. If you think the estimated variance of the first outcome is useful in predicting the distribution of the second outcome, then you should append both to the input features for the second model. If you think the distribution of the second outcome depends only on the expected value of the first outcome, then you can safely use the predict() method to return just that. While in theory you can only gain from including an extra predictor (the variance of the first outcome), my intuition is that in 99% of cases it won't significantly help prediction.

That also begs the question: since both of these are ngboost models and you haven't mentioned that the first is fit with a larger dataset or anything like that, I'd be very surprised if doing this two-step prediction measurably improves prediction of the second outcome over predicting it directly using the original input features alone. The first model stage cannot possibly add any more information than what are already in the data, which can already be efficiently exploited by the second model stage.

astrogilda commented 4 years ago

Thanks for the quick and helpful response!

So the real reason I want to do this is for a downstream task of getting feature importances using SHAP. Right now, a few of the top features for label2 are present because they are also the top features for label1, and the two labels are correlated. Other than that, these features shouldn't physically have a lot of impace on label2. My hope is that if I have pred_label1 as a feature in the input dataset for predicting label2, I can calculate SHAP values for all features, then throw out the shapley values corresponding to pred_label1, thus breaking the correlation. Does this make sense? Here is the issue I raised there--https://github.com/slundberg/shap/issues/1289.

alejandroschuler commented 4 years ago

Makes sense. May or may not work. You might also try including the actual labels for the first outcome as predictors for the second if all you're trying to do is understand the correlation structure. (Also remember that at the end of the day shapley values don't mean anything: https://arxiv.org/abs/1606.03490)

astrogilda commented 4 years ago

Actually ended up working :) That paper is certainly interesting, and given it has over 1200 citations, I bet quite important. I'll give it a read, thanks for linking!

On Thu, Jun 25, 2020 at 11:40 AM Alejandro Schuler notifications@github.com wrote:

Makes sense. May or may not work. You might also try including the actual labels for the first outcome as predictors for the second if all you're trying to do is understand the correlation structure. (Also remember that at the end of the day shapley values don't mean anything: https://arxiv.org/abs/1606.03490)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stanfordmlgroup/ngboost/issues/137#issuecomment-649632129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTOOHX7ER4WGGYVN2XAGSLRYNVYJANCNFSM4OHIIIKQ .

stanfordmlgroup / ngboost

Multioutput chained regression #137