neurodata / ProgLearn

NeuroData's package for exploring and using progressive learning algorithms
https://proglearn.neurodata.io
Other
35 stars 42 forks source link

Introduce honest sampling into sklearn RandomForestClassifier #387

Closed EYezerets closed 3 years ago

EYezerets commented 3 years ago

Is your feature request related to a problem? Please describe.

UncertaintyForest from ProgLearn implements both honest sampling (partitioning the feature space on one subset, estimating posteriors on a disjoint one, or structure/estimation points in Denil et al.) and finite sample correction, neither of which are currently available in sklearn RandomForestClassifier.

Implementing these features in sklearn would streamline ProgLearn (instead of needing to define the UncertaintyForest class) and allow other users to control these aspects of their random forests, rather than just turning bootstrapping on and off (which is currently available in RandomForestClassifier).

From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):

In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the subsample size s scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example i, it only uses the response Yi to estimate the within-leaf treatment effect τ using (5) or to decide where to place the splits, but not both.

In addition, Wager and Athey note that subsampling does not "waste" training data:

However, in our case, the forest subsampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the I/J-data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both I and J samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.

Key references:

Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325

Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839

Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html

Describe the solution you'd like

After talking with the EconML team and Randal Burns, our next step is to analyze how EconML was implemented for honest regressors and adapt the tree implementation (in Cython) for ProgLearn and sklearn honest classification trees.

Update (2/23): I made a fork of the sklearn repository and will update DecisionTreeClassifier in _classes.py, as well as _tree.pyx and _splitter.pyx. Then I will figure out how to call them from forest.py when building an UncertaintyForest in ProgLearn. If this is successful, I will draft an issue in sklearn.

Describe alternatives you've considered

One possibility is to use the bootstrap = False condition with sklearn DecisionTreeClassifier to ensure that the whole dataset is used, and then specify which (disjoint) subsets of that dataset are used for partitioning vs. estimating posteriors.

Another possibility is to set bootstrap = True and modify the implementation of max_samples to make sure sampling is done without replacement and in specified proportions, disjointly.

Additional context (e.g. screenshots)

PSSF23 commented 3 years ago

Is it an extension or milestone towards #9?

EYezerets commented 3 years ago

Oh @PSSF23 So sorry, I didn't see that earlier issue! I guess it's just a more detailed description/extension?

EYezerets commented 3 years ago

New text of issue in sklearn:

Is your feature request related to a problem? Please describe.

Honest sampling in forests was first outlined by Breiman et al. 1984 in Classification and Regression Forests, in which it was suggested that random forests improve performance when the feature space is partitioned on one subset of samples, and the posteriors are estimated on a disjoint subset of the samples. This idea was revived by Denil et al. as structure vs. estimation points, and clarified and implemented by Wager and Athey. They identified several benefits of honest sampling: reduced bias, centered confidence intervals, reduced mean squared error, and the possibility of building causal forests.

From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):

In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the subsample size s scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example i, it only uses the response Yi to estimate the within-leaf treatment effect τ using (5) or to decide where to place the splits, but not both.

In addition, Wager and Athey note that subsampling does not "waste" training data:

However, in our case, the forest subsampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the I/J-data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both I and J samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.

Describe the solution you’d like

EconML has forked scikit-learn to create honest trees and generalized random forests for causal questions. We intend, instead, to merge back into scikit-learn based on the insights from EconML in regression trees, while also building classification trees, with the ability to accept both dense and sparse data.

Key references:

Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Forests. 1984.

Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html

Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839

Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325