neurodata / treeple

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://treeple.ai
Other
65 stars 14 forks source link

ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

Closed adam2392 closed 4 months ago

adam2392 commented 4 months ago

Changes proposed in this pull request:

Stratification should occur every time we sample the dataset whether its subsampling, or bootstrapping.

  1. when we bootstrap sample the dataset to get the in-bag and oob samples, we stratify using sklearn.utils.resample (this PR)
  2. when we split the in-bag unique samples in halves to get structure and honest dataset, we stratify using StratifiedShuffleSplit.

Summary

On main branch, using the following test:

def test_honest_forest_posteriors_on_independent():
    from sktree.datasets import make_trunk_classification

    seed = 12345
    scores = []
    for idx in range(5):
        X, y = make_trunk_classification(
            n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
        )
        clf = HonestForestClassifier(
            n_estimators=100,
            random_state=idx,
            bootstrap=True,
            max_samples=1.6,
            n_jobs=-1,
            honest_prior="ignore",
            stratify=True,
        )
        clf.fit(X, y)

        oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
        auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
        scores.append(auc_score)

    print(np.mean(scores), scores)
    assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
    assert False

we get the error:

(sktree) (base) adam2392@arm64-apple-darwin20 scikit-tree % pytest ./sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent
==================================================================== test session starts ====================================================================
platform darwin -- Python 3.9.18, pytest-8.2.2, pluggy-1.5.0 -- /Users/adam2392/miniforge3/envs/sktree/bin/python3.9
cachedir: .pytest_cache
rootdir: /Users/adam2392/Documents/scikit-tree
configfile: pyproject.toml
plugins: cov-5.0.0, flaky-3.8.1
collected 1 item                                                                                                                                            

sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent FAILED                                                               [100%]

========================================================================= FAILURES ==========================================================================
_______________________________________________________ test_honest_forest_posteriors_on_independent ________________________________________________________

    def test_honest_forest_posteriors_on_independent():
        from sktree.datasets import make_trunk_classification

        seed = 12345
        scores = []
        for idx in range(5):
            X, y = make_trunk_classification(
                n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
            )
            clf = HonestForestClassifier(
                n_estimators=100,
                random_state=idx,
                bootstrap=True,
                max_samples=1.6,
                n_jobs=-1,
                honest_prior="ignore",
                stratify=True,
            )
            clf.fit(X, y)

            oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
            auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
            scores.append(auc_score)

        print(np.mean(scores), scores)
>       assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
E       AssertionError: 0.47548828125 [0.49951171875, 0.479736328125, 0.408203125, 0.464111328125, 0.52587890625]

sktree/tests/test_honest_forest.py:519: AssertionError

However, if we run it on this branch, we get 0.50498046875 [0.484375, 0.53076171875, 0.513671875, 0.46533203125, 0.53076171875], which shows the stratification fixes the bias.

adam2392 commented 4 months ago

Interestingly, this is not an issue on RandomForestClassifier, so I suspect there is a relationship to the empty leaves, or the fact that we use a separate dataset to estimate the posteriors

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 82.14286% with 10 lines in your changes missing coverage. Please review.

Project coverage is 78.55%. Comparing base (b8da7b0) to head (290c5f6). Report is 1 commits behind head on main.

Files Patch % Lines
sktree/ensemble/_honest_forest.py 77.77% 4 Missing and 2 partials :warning:
sktree/tree/_honest_tree.py 80.00% 2 Missing and 1 partial :warning:
sktree/stats/forestht.py 88.88% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #283 +/- ## ========================================== + Coverage 76.79% 78.55% +1.75% ========================================== Files 25 24 -1 Lines 2267 2252 -15 Branches 409 414 +5 ========================================== + Hits 1741 1769 +28 + Misses 402 352 -50 - Partials 124 131 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.