stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 214 forks source link

incorrect feature importance shape due to col_sample #181

Closed samshipengs closed 4 years ago

samshipengs commented 4 years ago

ngboost-0.3.3

I'm trying to get feature importance from ngb, which was fitted with e.g. col_sample=0.8, then when I tried to look at the feature importance I got shape mismatch error (comparing to my input feature length), the returned feature importance length looks like was the length of features after multiplying by col_sample, (when i fitted with col_sample=1, I get all features in feature importance), but I imagine this is not supposed to happen. col_sample should be only used when constructing trees, the feature importance should still be the full feature length.

samshipengs commented 4 years ago

I just started using ngboost and im not familiar with the code base, maybe in the feature importance calculation, when we grab the importance value from each tree, we should map it back to what feature columns were chosen to fit that tree.

alejandroschuler commented 4 years ago

Interesting... thanks for reporting this! I'll look into it as soon as I can, but unfortunately it's a busy time at work for me. I had a quick look and I think the thing to do is to use self.col_idxs (the column indices of the features subsampled at each iteration) to calculate the feature importances, which is basically what you're suggesting. I think with your understanding (and the knowledge that self.col_idxs is a list containing the subsample index for each iteration) you may be able to do this fix yourself faster than I will get to it. You'll just have to set the importance to 0 for the features not present in each tree before averaging so that the arrays are all of the same size.

Note for myself: I also just noticed col_sample isn't in the docstring for the NGBoost class

alejandroschuler commented 4 years ago

@samshipengs if you are willing to give this a try, test it out, and make a pull request it would be super appreciated by the whole community, I am sure!

samshipengs commented 4 years ago

@alejandroschuler thanks for the suggestion. I will give it a shot and try to make a PR.

samshipengs commented 4 years ago

@alejandroschuler I just created a PR, this is my first time making PR to a public project, so if there is something I'm doing wrong or there is a better practice please let me know.

regarding the fix and code i have a few things want to discuss:

  1. Not sure if a inner function is okay here or a class method is preferred.
  2. You'll just have to set the importance to 0, this is what the PR currently has but I'm not sure this is correct, since for cases where some columns did not get sampled or get sampled less than others (this probably won't be an issue for large n_estimators and large enough col_sample) then their importance would be dragged down by the zeros (we are doing an average after), so I'm just thinking would using np.nan make more sense (and exclude nans in averaging)?
  3. Any thoughts on using type annotations?
alejandroschuler commented 4 years ago

@alejandroschuler I just created a PR, this is my first time making PR to a public project, so if there is something I'm doing wrong or there is a better practice please let me know.

regarding the fix and code i have a few things want to discuss:

  1. Not sure if a inner function is okay here or a class method is preferred.
  2. You'll just have to set the importance to 0, this is what the PR currently has but I'm not sure this is correct, since for cases where some columns did not get sampled or get sampled less than others (this probably won't be an issue for large n_estimators and large enough col_sample) then their importance would be dragged down by the zeros (we are doing an average after), so I'm just thinking would using np.nan make more sense (and exclude nans in averaging)?
  3. Any thoughts on using type annotations?

Awesome! Just had a look, replied to these questions in the PR.

alejandroschuler commented 4 years ago

Thanks again @samshipengs !

samshipengs commented 4 years ago

Thanks for the review!