Open blitchj opened 1 month ago
See also: #29495 and #29496 for discussion on the np.argsort
inconsistency between machines.
Also, adding this means we take a hit on the performance cause we're moving from quicksort
to mergesort
. So I wonder if it's worth it.
Also, adding this means we take a hit on the performance cause we're moving from
quicksort
tomergesort
. So I wonder if it's worth it.
At the scale of the number of input features I'm not sure how big the difference is, so my suggestion may just take even more time.
But could unstable sort run, and if there are ties at the threshold, then it could switch to stable sort, and pssoibly even just stable sort the tied region?
Worst case, with lots of big ties, this is obviously much worse. But in the common case with no ties in the relevant region, it may be better?
Another thought on this. I think getting consistent results in the different machines with the same versions is pretty important. But stable sort is not the only or best way to do this.
While in the GroupKFold case it may not make sense to add an option, for RFE what if there was an on_ties
option. I could imagine at least 3 options.
np.arange
as the second key to a np.lexsort
instead of argsort. So in general, we do not guarantee the same results between machines. Different machines for a broad range of reasons will give different results.
So my question is, why is this particular one so important, given that we'd get a performance hit as a result? The same goes for the other issues you've opened.
Unless we have a very clear motivation, I think we can close them as "won'tfix" since we are already giving different results across machines anyway.
So in general, we do not guarantee the same results between machines. Different machines for a broad range of reasons will give different results.
So my question is, why is this particular one so important, given that we'd get a performance hit as a result? The same goes for the other issues you've opened.
Unless we have a very clear motivation, I think we can close them as "won'tfix" since we are already giving different results across machines anyway.
I see, I wasn't familiar with this lack of consistency between machines in general for scikit-learn.
For my model we were required to show reproducibility and we run on Kubeflow (so different machines each time) and both the RFE and GroupKFold sorts caused changes in the final selected features and hyperparameters.
I thought these may be issues others would run into, so I proposed the fix. If not relevant, then I will just extend the classes with the fix on my end.
Thanks for the context.
RFE and GroupKFold sorts caused changes in the final selected features and hyperparameters.
Wouldn't that mean you shouldn't rely too much on the fact that those particular features are selected? In practice, you could try running the same pipeline a few times, changing your random seed or shuffling the data, and try to look at the variance in your hyper parameters or see how often certain features are selected or if the ones being switched are correlated with one another and other features in some way.
I personally would probably go down that route instead of trying to make things 100% replicable.
cc @glemaitre @lesteve @adam2392 , I don't think we'd want to add this due to the performance cost, since I'm not sure of the gain here, and the same for the GroupKFold issue.
Also, adding this means we take a hit on the performance cause we're moving from
quicksort
tomergesort
. So I wonder if it's worth it.
I was curious to see what type of performance hit this would cause so I ran a couple experiments (not at all exhaustive, but just to get an idea)
First is sorting randomized floating point arrays of size 1-100 million in powers of 10 increments. Second is sorting randomized int arrays where there are only 2 values with 0.99 probability of one of the values.
1st:
import numpy as np
import timeit
for n in range(1,9):
print(f"1e{n}")
print("Default")
np.random.seed(42)
print(timeit.timeit("np.argsort(x1)", "x1 = np.random.random(10**n)", number=10, globals=globals()))
print("Stable")
np.random.seed(42)
print(timeit.timeit("np.argsort(x2, kind='stable')", "x2 = np.random.random(10**n)", number=10, globals=globals()))
print()
results:
1e1
Default
0.000354141928255558
Stable
8.265301585197449e-05
1e2
Default
0.0010631910990923643
Stable
7.989001460373402e-05
1e3
Default
0.0005585530307143927
Stable
0.0004761170130223036
1e4
Default
0.006417972967028618
Stable
0.007736145984381437
1e5
Default
0.08507612603716552
Stable
0.10505266906693578
1e6
Default
1.148862882051617
Stable
1.3665528818964958
1e7
Default
14.92248012800701
Stable
19.03327673696913
1e8
Default
217.04379430809058
Stable
253.50364848808385
Second:
import numpy as np
import timeit
for n in range(1,9):
print(f"1e{n}")
print("Default")
np.random.seed(42)
print(timeit.timeit("np.argsort(x1)", "x1 = np.random.choice(np.arange(2), 10**n, p=[0.99, 0.01])", number=10, globals=globals()))
print("Stable")
np.random.seed(42)
print(timeit.timeit("np.argsort(x2, kind='stable')", "x2 = np.random.choice(np.arange(2), 10**n, p=[0.99, 0.01])", number=10, globals=globals()))
print()
results:
1e1
Default
0.00016195978969335556
Stable
6.85439445078373e-05
1e2
Default
7.036002352833748e-05
Stable
4.3672043830156326e-05
1e3
Default
7.371511310338974e-05
Stable
4.794588312506676e-05
1e4
Default
0.0008338191546499729
Stable
0.0005149259231984615
1e5
Default
0.009775097016245127
Stable
0.006571172038093209
1e6
Default
0.11906793317757547
Stable
0.08338876394554973
1e7
Default
1.9360376859549433
Stable
1.2638011060189456
1e8
Default
24.253193804994226
Stable
15.49432900804095
RFE does sort on feature importances so the sort dimension is the same as the features, and ties are mostly likely for 0 importance (for some algorithms).
GroupKFold does sort on group lengths which can be up to the length of the rows in the data. I would suspect this is more likely to have common values compared to the feature importances, if only because it is integer valued, but also due to some use cases where all groups are identical size.
There is definitely a sizable performance hit with larger lengths and random arrays, but with arrays with few values it is faster even for large sizes, also for small lengths (possibly common in RFE with few features Order of Magnitude ~100) the stable sort is comparable or faster for randomized floats.
Not sure if this context is useful for making the decision on next steps, but wanted to share.
Having reproducible pipelines (at least via an option) can be pragmatically helpful. This can be helpful for caching stuff for instance, but also to make sure that you can easily isolate sources of variability that are expected (tied feature importance values used for feature selection) from those that are not (e.g. bugs in custom feature engineering code).
From a statistical point of view however, I agree with Adrin that you should not draw conclusions that are significantly impacted by the way you break ties (e.g. based on the original feature order when using stable-sorting).
One way to proceed could be to add a random_state
argument to RFE
and (always?) break ties randomly by adding some machine precision level random jitter on the feature importance values such that ties are deterministically broken by a user-settable hyper-parameter. This would then make it easy to run a statistical analysis of the impact of tie-breaking.
We have documented other problems related to our lack of random tie breaking in:
Describe the bug
RFE uses np.argsort on the feature_importances from the estimator, this is not repeatable across machines. This only matters when there are ties in the feature importances that overlap with the threshold. For example if the importances are [0, 2, 0, 1] and it needs to eliminate 1 feature, it may not choose the same feature to eliminate on all machines.
Steps/Code to Reproduce
It is difficult to write code that will show this as it requires multiple machines to test; however, the following test could be added to the _rfe unit tests to check if the sort is stable, though the sort could be consistent but not stable, so this isn't enough to prove an issue.
Expected Results
Expected to get the same results on all machines, being stable is not necessary though it is an easy way to enforce testable consistency.
Actual Results
Current sort does not result in a stable sort and fails the previously provided unit test.
Versions
This issue is most likely to occur with zero importance features, especially at the initial steps of RFE.