Open markusloecher opened 2 years ago
Has there been any feedback for this?
No response yet. Allow me to point out other well known random forest implementations which offer this sub sampling as an alternative to the bootstrap:
Thanks
ping @jnothman @NicolasHug
What is you concrete proposal?
Adding an option replace = True/False
to random forests?
A related issue seems to be #20177.
I think, the proposal is to have a way to do subsampling without bootstrapping (bootstrap=False
). Currently, when we set boostrap=False
, the whole dataset is used for each tree and max_samples
parameter is not valid after that.
I guess, the referred issue is about custom function for sampling - hence bootstrap=True
.
Yes, exactly, my proposal was to allow sub-sampling instead of the bootstrap, which is not currently possible.
@markusloecher I extended the PR description at the top. I'm +1 for this feature, but would very much like to hear other opinions.
Then, there is the question of API. I would favor option 2 in the description: A new option row_subsampling
(naming to be discussed) which would slightly change current behaviour, but seems semantically clean.
@amueller @adrinjalali @glemaitre @NicolasHug You seem to care about random forests :smirk:
an alternative API would be to deprecate bootstrap
, and add a sampling
parameter, with "bootstrap"
as default, and accept a callable, or a splitter which would give the subsamples.
max_samples
becomes not valid when sampling
is callable or splittter ?
In terms of API, I like the proposal of @adrinjalali
In terms of including this new feature, I would be reticent. I would only consider this option if a clean benchmark shows that we can get clear/significant improvements sometimes. My reasoning is that our tree-based algorithms come with a large number of parameters and options already. Increasing the number of options and parameters make it difficult to document the important things regarding the algorithm.
So my reasoning here is if the option does not make a clear cutoff in generalization performance for at least a certain percentage of datasets (maybe a minimum of ~20%), then we might be able to leave without it. In this regard, I like the proposal of @adrinjalali because it opens the possibility for people to implement their strategy and avoid us to complexity our documentation with the available options implemented in scikit-learn.
In general, I would say that we should base our decision on well conducted research and not benchmark everything ourselves. In this case, Stobl et al. focus was on variable selection and importance, not predictive performance. They conclude:
From our simulation results we can see, however, that the effect of bootstrap sampling is mostly superposed by the much stronger effect of variable selection bias when comparing the conditions of sampling with and without replacement for the randomForest function only.
@markusloecher Do you have references that investigated the impact of the row subsampling scheme w.r.t. predictive performance?
On the other side, the ranger package has an option replace = TRUE or FALSE
. With @adrinjalali's proposal (3), we would not increase the number of parameters. So I'm still +1.
I'm not aware of benchmark experiments that found systematic differences in terms of predictive performance between random forest variants with vs. without resampling. I'll ping Carolin (Strobl) and Anne-Laure (Boulesteix) who may know more. In my opinion that is not the point, though. The idea would be to use subsampling because then you may get unbiased variable selection - while preserving the predictive performance.
I also do not know references that investigated the impact of the row subsampling scheme w.r.t. predictive performance. However, in the light of the growing importance of interpretable machine learning, the sole focus on prediction loss is maybe less justifiable than in the past?
I had a quick glance at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25 and it seems that, when considering variable, the ability to do subsampling without replacement importance is mostly useful for cforest that is not based on Gini-based splits as done in scikit-learn's RF and R's randomForest
.
That being said, I am not opposed to having an option to control the subsampling strategy but I am not sure it will bring any practical benefit to our users on its own. I still need to read the reference more carefully to make sure I did not overlook anything important in the context of the scikit-learn implementation.
You are right in that the variable importance measures will only be unbiased if subsampling is combined with unbiased variable selection in the individual trees. However, the unbiased selection does not necessarily need to be based on conditional inference (as in cforest). It could also be combined with Gini-based splits provided that these are adjusted appropriately for the number of possible splits etc., see Strobl, Boulesteix, Augustin (2007), Computational Statistics & Data Analysis, 52(1), 483-501. doi:10.1016/j.csda.2006.12.030
Describe the workflow you want to enable
I do appreciate the current option of disabling bootstrapping via the Boolean argument
bootstrap
. However, currently there is only one alternative: If False, the whole (identical) dataset is used to build each tree. There is well known research though (starting with the paper by Strobl et al. from 2007) that in certain situations subsamples drawn without replacement leads to better performance. Many well known random forest implementations (such as ranger ) offer this sub sampling as an alternative to the bootstrap.I would greatly appreciate it if sklearn would offer the same.
Describe your proposed solution
For both
RandomForestRegressor
andRandomForestClassifier
allow the user to draw subsamples without replacement for each tree instead of the bootstrap. The user can choose the fraction of sub samples drawn for each tree (default: 0.632 ) Ideally the functions_generate_unsampled_indices
and_generate_sampled_indices
would still work.Describe alternatives you've considered, if relevant
No response
Additional context
No response
API considerations
Currently, random forests have the option
bootstap=True
orFalse
. If this new feature of tree-wise row-subsampling without replacement is added, there are several options:with_replacement=True
(default), orFalse
, that only takes effect ifbootstrap=True
. Disadvantage: The term bootstrap explicitly means sampling with replacement.row_subsampling=True
(default) orFalse
, which samples with replacement ifbootstrap=True
and without replacement ifbootstrap=False
. Disadvantage: It would change current behaviour forbootstrap=False
, which currently means no sampling at all.sampling="bootstrap"
(default), and allow callable / splitter to be passed. Deprecate optionbootstrap
(proposed in https://github.com/scikit-learn/scikit-learn/issues/20953#issuecomment-923957749).