RFE results are inconsistent between machines with ties in feature importances at threshold

blitchj commented 1 month ago

Describe the bug

RFE uses np.argsort on the feature_importances from the estimator, this is not repeatable across machines. This only matters when there are ties in the feature importances that overlap with the threshold. For example if the importances are [0, 2, 0, 1] and it needs to eliminate 1 feature, it may not choose the same feature to eliminate on all machines.

Steps/Code to Reproduce

It is difficult to write code that will show this as it requires multiple machines to test; however, the following test could be added to the _rfe unit tests to check if the sort is stable, though the sort could be consistent but not stable, so this isn't enough to prove an issue.

def test_rfe_ties_around_threshold():
    X, y = make_classification(n_features=47, random_state=0)
    clf = MockClassifier() # mock classifier returns constant feature_importances
    rfe = RFE(estimator=clf, n_features_to_select=4, step=2)
    rfe.fit(X, y)

    assert_array_equal(rfe.support_ ,np.array([*[False]*43, *[True]*4]))

Expected Results

Expected to get the same results on all machines, being stable is not necessary though it is an easy way to enforce testable consistency.

Actual Results

Current sort does not result in a stable sort and fails the previously provided unit test.

Versions

main branch

This issue is most likely to occur with zero importance features, especially at the initial steps of RFE.

blitchj commented 1 month ago

See also: #29495 and #29496 for discussion on the np.argsort inconsistency between machines.

adrinjalali commented 1 month ago

Also, adding this means we take a hit on the performance cause we're moving from quicksort to mergesort. So I wonder if it's worth it.

blitchj commented 1 month ago

Also, adding this means we take a hit on the performance cause we're moving from quicksort to mergesort. So I wonder if it's worth it.

At the scale of the number of input features I'm not sure how big the difference is, so my suggestion may just take even more time.

But could unstable sort run, and if there are ties at the threshold, then it could switch to stable sort, and pssoibly even just stable sort the tied region?

Worst case, with lots of big ties, this is obviously much worse. But in the common case with no ties in the relevant region, it may be better?

blitchj commented 1 month ago

Another thought on this. I think getting consistent results in the different machines with the same versions is pretty important. But stable sort is not the only or best way to do this.

While in the GroupKFold case it may not make sense to add an option, for RFE what if there was an on_ties option. I could imagine at least 3 options.

Current solution, not guaranteed to be consistent between machines, good enough for many probably.
Stable sort, consistent, slower, and biased to keep either the first or last features depending on the direction implemented.
Random, take a random seed and shuffle the ties. Even slower, but repeatable and no bias. Could be implemented by adding a randomly shuffled np.arange as the second key to a np.lexsort instead of argsort.

adrinjalali commented 1 month ago

So in general, we do not guarantee the same results between machines. Different machines for a broad range of reasons will give different results.

So my question is, why is this particular one so important, given that we'd get a performance hit as a result? The same goes for the other issues you've opened.

Unless we have a very clear motivation, I think we can close them as "won'tfix" since we are already giving different results across machines anyway.

blitchj commented 1 month ago

So in general, we do not guarantee the same results between machines. Different machines for a broad range of reasons will give different results.

So my question is, why is this particular one so important, given that we'd get a performance hit as a result? The same goes for the other issues you've opened.

Unless we have a very clear motivation, I think we can close them as "won'tfix" since we are already giving different results across machines anyway.

I see, I wasn't familiar with this lack of consistency between machines in general for scikit-learn.

For my model we were required to show reproducibility and we run on Kubeflow (so different machines each time) and both the RFE and GroupKFold sorts caused changes in the final selected features and hyperparameters.

I thought these may be issues others would run into, so I proposed the fix. If not relevant, then I will just extend the classes with the fix on my end.

adrinjalali commented 1 month ago

Thanks for the context.

RFE and GroupKFold sorts caused changes in the final selected features and hyperparameters.

Wouldn't that mean you shouldn't rely too much on the fact that those particular features are selected? In practice, you could try running the same pipeline a few times, changing your random seed or shuffling the data, and try to look at the variance in your hyper parameters or see how often certain features are selected or if the ones being switched are correlated with one another and other features in some way.

I personally would probably go down that route instead of trying to make things 100% replicable.

cc @glemaitre @lesteve @adam2392 , I don't think we'd want to add this due to the performance cost, since I'm not sure of the gain here, and the same for the GroupKFold issue.

blitchj commented 1 month ago

Also, adding this means we take a hit on the performance cause we're moving from quicksort to mergesort. So I wonder if it's worth it.

I was curious to see what type of performance hit this would cause so I ran a couple experiments (not at all exhaustive, but just to get an idea)

First is sorting randomized floating point arrays of size 1-100 million in powers of 10 increments. Second is sorting randomized int arrays where there are only 2 values with 0.99 probability of one of the values.

1st:

import numpy as np
import timeit

for n in range(1,9):
    print(f"1e{n}")
    print("Default")
    np.random.seed(42)
    print(timeit.timeit("np.argsort(x1)", "x1 = np.random.random(10**n)", number=10, globals=globals()))
    print("Stable")
    np.random.seed(42)
    print(timeit.timeit("np.argsort(x2, kind='stable')", "x2 = np.random.random(10**n)", number=10, globals=globals()))
    print()

results:

1e1
Default
0.000354141928255558
Stable
8.265301585197449e-05

1e2
Default
0.0010631910990923643
Stable
7.989001460373402e-05

1e3
Default
0.0005585530307143927
Stable
0.0004761170130223036

1e4
Default
0.006417972967028618
Stable
0.007736145984381437

1e5
Default
0.08507612603716552
Stable
0.10505266906693578

1e6
Default
1.148862882051617
Stable
1.3665528818964958

1e7
Default
14.92248012800701
Stable
19.03327673696913

1e8
Default
217.04379430809058
Stable
253.50364848808385

Second:

import numpy as np
import timeit

for n in range(1,9):
    print(f"1e{n}")
    print("Default")
    np.random.seed(42)
    print(timeit.timeit("np.argsort(x1)", "x1 = np.random.choice(np.arange(2), 10**n, p=[0.99, 0.01])", number=10, globals=globals()))
    print("Stable")
    np.random.seed(42)
    print(timeit.timeit("np.argsort(x2, kind='stable')", "x2 = np.random.choice(np.arange(2), 10**n, p=[0.99, 0.01])", number=10, globals=globals()))
    print()

results:

1e1
Default
0.00016195978969335556
Stable
6.85439445078373e-05

1e2
Default
7.036002352833748e-05
Stable
4.3672043830156326e-05

1e3
Default
7.371511310338974e-05
Stable
4.794588312506676e-05

1e4
Default
0.0008338191546499729
Stable
0.0005149259231984615

1e5
Default
0.009775097016245127
Stable
0.006571172038093209

1e6
Default
0.11906793317757547
Stable
0.08338876394554973

1e7
Default
1.9360376859549433
Stable
1.2638011060189456

1e8
Default
24.253193804994226
Stable
15.49432900804095

RFE does sort on feature importances so the sort dimension is the same as the features, and ties are mostly likely for 0 importance (for some algorithms).

GroupKFold does sort on group lengths which can be up to the length of the rows in the data. I would suspect this is more likely to have common values compared to the feature importances, if only because it is integer valued, but also due to some use cases where all groups are identical size.

There is definitely a sizable performance hit with larger lengths and random arrays, but with arrays with few values it is faster even for large sizes, also for small lengths (possibly common in RFE with few features Order of Magnitude ~100) the stable sort is comparable or faster for randomized floats.

Not sure if this context is useful for making the decision on next steps, but wanted to share.

ogrisel commented 1 month ago

Having reproducible pipelines (at least via an option) can be pragmatically helpful. This can be helpful for caching stuff for instance, but also to make sure that you can easily isolate sources of variability that are expected (tied feature importance values used for feature selection) from those that are not (e.g. bugs in custom feature engineering code).

From a statistical point of view however, I agree with Adrin that you should not draw conclusions that are significantly impacted by the way you break ties (e.g. based on the original feature order when using stable-sorting).

One way to proceed could be to add a random_state argument to RFE and (always?) break ties randomly by adding some machine precision level random jitter on the feature importance values such that ties are deterministically broken by a user-settable hyper-parameter. This would then make it easy to run a statistical analysis of the impact of tie-breaking.

We have documented other problems related to our lack of random tie breaking in:

https://github.com/scikit-learn/scikit-learn/issues/26428 (in hist gradient boosting classifier/regressor)
https://github.com/scikit-learn/scikit-learn/issues/26428#issuecomment-1562885651 (in coordinate descent solvers).

scikit-learn / scikit-learn