scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.04k stars 25.19k forks source link

Add random forest row-subsampling without replacement #20953

Open markusloecher opened 2 years ago

markusloecher commented 2 years ago

Describe the workflow you want to enable

I do appreciate the current option of disabling bootstrapping via the Boolean argument bootstrap. However, currently there is only one alternative: If False, the whole (identical) dataset is used to build each tree. There is well known research though (starting with the paper by Strobl et al. from 2007) that in certain situations subsamples drawn without replacement leads to better performance. Many well known random forest implementations (such as ranger ) offer this sub sampling as an alternative to the bootstrap.

I would greatly appreciate it if sklearn would offer the same.

Describe your proposed solution

For both RandomForestRegressor and RandomForestClassifier allow the user to draw subsamples without replacement for each tree instead of the bootstrap. The user can choose the fraction of sub samples drawn for each tree (default: 0.632 ) Ideally the functions _generate_unsampled_indices and _generate_sampled_indices would still work.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

API considerations

Currently, random forests have the option bootstap=True or False. If this new feature of tree-wise row-subsampling without replacement is added, there are several options:

  1. Add a new option with_replacement=True (default), or False, that only takes effect if bootstrap=True. Disadvantage: The term bootstrap explicitly means sampling with replacement.
  2. Add a new option row_subsampling=True (default) or False, which samples with replacement if bootstrap=True and without replacement if bootstrap=False. Disadvantage: It would change current behaviour for bootstrap=False, which currently means no sampling at all.
  3. Add new option sampling="bootstrap" (default), and allow callable / splitter to be passed. Deprecate option bootstrap (proposed in https://github.com/scikit-learn/scikit-learn/issues/20953#issuecomment-923957749).
venkyyuvy commented 2 years ago

Has there been any feedback for this?

markusloecher commented 2 years ago

No response yet. Allow me to point out other well known random forest implementations which offer this sub sampling as an alternative to the bootstrap:

Thanks

venkyyuvy commented 2 years ago

ping @jnothman @NicolasHug

lorentzenchr commented 2 years ago

What is you concrete proposal? Adding an option replace = True/False to random forests? A related issue seems to be #20177.

venkyyuvy commented 2 years ago

I think, the proposal is to have a way to do subsampling without bootstrapping (bootstrap=False). Currently, when we set boostrap=False, the whole dataset is used for each tree and max_samples parameter is not valid after that.

I guess, the referred issue is about custom function for sampling - hence bootstrap=True.

markusloecher commented 2 years ago

Yes, exactly, my proposal was to allow sub-sampling instead of the bootstrap, which is not currently possible.

lorentzenchr commented 2 years ago

@markusloecher I extended the PR description at the top. I'm +1 for this feature, but would very much like to hear other opinions. Then, there is the question of API. I would favor option 2 in the description: A new option row_subsampling (naming to be discussed) which would slightly change current behaviour, but seems semantically clean.

@amueller @adrinjalali @glemaitre @NicolasHug You seem to care about random forests :smirk:

adrinjalali commented 2 years ago

an alternative API would be to deprecate bootstrap, and add a sampling parameter, with "bootstrap" as default, and accept a callable, or a splitter which would give the subsamples.

venkyyuvy commented 2 years ago

max_samples becomes not valid when sampling is callable or splittter ?

glemaitre commented 2 years ago

In terms of API, I like the proposal of @adrinjalali

In terms of including this new feature, I would be reticent. I would only consider this option if a clean benchmark shows that we can get clear/significant improvements sometimes. My reasoning is that our tree-based algorithms come with a large number of parameters and options already. Increasing the number of options and parameters make it difficult to document the important things regarding the algorithm.

So my reasoning here is if the option does not make a clear cutoff in generalization performance for at least a certain percentage of datasets (maybe a minimum of ~20%), then we might be able to leave without it. In this regard, I like the proposal of @adrinjalali because it opens the possibility for people to implement their strategy and avoid us to complexity our documentation with the available options implemented in scikit-learn.

lorentzenchr commented 2 years ago

In general, I would say that we should base our decision on well conducted research and not benchmark everything ourselves. In this case, Stobl et al. focus was on variable selection and importance, not predictive performance. They conclude:

From our simulation results we can see, however, that the effect of bootstrap sampling is mostly superposed by the much stronger effect of variable selection bias when comparing the conditions of sampling with and without replacement for the randomForest function only.

@markusloecher Do you have references that investigated the impact of the row subsampling scheme w.r.t. predictive performance?

On the other side, the ranger package has an option replace = TRUE or FALSE. With @adrinjalali's proposal (3), we would not increase the number of parameters. So I'm still +1.

zeileis commented 2 years ago

I'm not aware of benchmark experiments that found systematic differences in terms of predictive performance between random forest variants with vs. without resampling. I'll ping Carolin (Strobl) and Anne-Laure (Boulesteix) who may know more. In my opinion that is not the point, though. The idea would be to use subsampling because then you may get unbiased variable selection - while preserving the predictive performance.

markusloecher commented 2 years ago

I also do not know references that investigated the impact of the row subsampling scheme w.r.t. predictive performance. However, in the light of the growing importance of interpretable machine learning, the sole focus on prediction loss is maybe less justifiable than in the past?

ogrisel commented 2 years ago

I had a quick glance at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25 and it seems that, when considering variable, the ability to do subsampling without replacement importance is mostly useful for cforest that is not based on Gini-based splits as done in scikit-learn's RF and R's randomForest.

That being said, I am not opposed to having an option to control the subsampling strategy but I am not sure it will bring any practical benefit to our users on its own. I still need to read the reference more carefully to make sure I did not overlook anything important in the context of the scikit-learn implementation.

zeileis commented 2 years ago

You are right in that the variable importance measures will only be unbiased if subsampling is combined with unbiased variable selection in the individual trees. However, the unbiased selection does not necessarily need to be based on conditional inference (as in cforest). It could also be combined with Gini-based splits provided that these are adjusted appropriately for the number of possible splits etc., see Strobl, Boulesteix, Augustin (2007), Computational Statistics & Data Analysis, 52(1), 483-501. doi:10.1016/j.csda.2006.12.030