stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.63k stars 214 forks source link

Is it possible to have features only for certain parameters? #244

Open z-feldman opened 3 years ago

z-feldman commented 3 years ago

I've really enjoyed being able to look at feature importance and shap values for different parameters, it can be really insightful. To take it a step further, I've been wondering if it's possible to have certain features be specific to some parameters but not used in the estimation of other parameters.

I was using an automated feature selection - subbed in catboost as it doesn't support ngboost - and it dropped some features from the point estimate prediction that were at the top of the feature importance for the variance parameter when using ngboost. So that got me thinking if it was possible to add a way to specify in the model "I want features [a,b,c] to predict parameter_1 but I want [c,d,e] to predict parameter_2".

I'm not sure how the estimation of each parameter is working under the hood so I'm not sure if this is possible or not. Either way, love the package, thanks for the great work!

alejandroschuler commented 3 years ago

That's definitely feasible. It would be very similar to the method currently used to (randomly) subsample columns at each boosting iteration. You'd now need to keep track of what columns were used per-parameter in each iteration instead of globally in each iteration. So it's a little more "paperwork", so to speak, but doable.

see: https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/ngboost.py#L134 https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/ngboost.py#L260

Ultimately I'm not sure how much it would change the final predictions. Boosting models basically do their own feature selection so there's usually not much point to doing it a-priori unless you have a strong inductive bias you want to provide (but even then- usually easier to let the model figure it out for itself). You're already seeing evidence of this in the feature importances. NGBoost is choosing different features to predict each of the parameters because different features turn out to be more or less useful. Interpreting what that means (or more likely doesn't) is a whole different story, of course.

z-feldman commented 3 years ago

Thanks! I'll check that out. My specific problem is a time-series problem where I started out using rolling means and std deviations since my time-series isn't super strong lol. So I planned on keeping the rolling statistic for it's respective parameter. I agree that usually this wouldn't be super necessary and the other, non-rolling variables, I'm going to keep for both parameters since there's no strong prior on those. Thanks again!