[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy - Githubissues

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.14k stars 526 forks source link

[BUG] Sporadic KBinsDiscretizer pytests fail with quantile strategy #2933

Open divyegala opened 3 years ago

divyegala commented 3 years ago

When this PR is merged https://github.com/rapidsai/cuml/pull/2932, it will start marking a KBinsDiscretizer tests as xfail due to sporadic failures observed in CI

viclafargue commented 3 years ago

Thanks for opening the issue. Since the errors pops out sporadically, I'll put them here as reference.

Fails on: test_kbinsdiscretizer[cudf-quantile-ordinal-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 100 / 10000 (1%)
E           Max absolute difference: 1.
E           Max relative difference: 0.25

test_kbinsdiscretizer[cudf-quantile-ordinal-20] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:579:
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=1e-05
E           
E           Mismatched elements: 25 / 10000 (0.25%)
E           Max absolute difference: 1.
E           Max relative difference: 0.05263158

test_kbinsdiscretizer[cudf-quantile-onehot-dense-5] – cuml.test.test_preprocessing

cuml/test/test_preprocessing.py:563:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cuml/_thirdparty/sklearn/utils/skl_dependencies.py:359: in fit_transform
    return self.fit(X, **fit_params).transform(X)
cuml/_thirdparty/sklearn/preprocessing/_discretization.py:228: in fit
    categories=np.array([np.arange(i) for i in self.n_bins_]),
/opt/conda/envs/rapids/lib/python3.7/site-packages/cupy/_creation/from_data.py:41: in array
    return core.array(obj, dtype, copy, order, subok, ndmin)
cupy/core/core.pyx:2059: in cupy.core.core.array
    ???
cupy/core/core.pyx:2138: in cupy.core.core.array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Unsupported dtype object
cupy/core/core.pyx:2210: ValueError

JohnZed commented 3 years ago

@wphicks can look at this after current work on Silhouette score.

wphicks commented 3 years ago

The reported value error comes from an intermittent failure in the percentile calculation here. Occasionally the final element in the percentile array is NaN, and the error propagates from there. Breaking in at that point with a debugger, we see that numpy's percentile call returns the correct value, and moreover that converting the column array to a numpy array and back allows cupy to compute the correct value as well. I am working to determine the root cause, but my leading hunch is uninitialized memory.

wphicks commented 3 years ago

Something has changed in the past week in terms of reproducing this. I am still able to reproduce the issue, but it takes much longer to do so, and I have only seen a failure on the cudf-quantile onehot-dense versions of this test, whereas before I saw it on at least some other cudf-quantile tests. When I began tracking it, mean time to failure was around 180 iterations. Now, it is slightly under 500. I don't have any insight as to what has changed or whether it's specific to my system.

I can say with some confidence now that I cannot reproduce this with any other data input format except cudf. I've extracted the sequence of conversions that we perform into a standalone example now and am attempting to reproduce this independently of cuML.

viclafargue commented 3 years ago

Same, it's really hard to reproduce. I wonder if the disappearance of the ordinal error is not linked to recent change in #3194 that converts cuDF dataframes to cuPy arrays before the call to input_to_cuml_array.

wphicks commented 3 years ago

I have fairly compelling evidence now that the ValueError is a result of some interaction with pytest's fixture code. I have not successfully reproduced the ValueError in over 30K runs after refactoring the test to avoid using the fixture (but using exactly the same code to generate the test data). I still saw the AssertionError once.

wphicks commented 3 years ago

It is worth mentioning that I changed the fixture scope to function rather than session for these tests.

wphicks commented 3 years ago

Documenting some miscellaneous findings here:

This bug appears to reproduce less often (approx every 1500 runs) with the debug build
Running cuda-memcheck's racecheck tool on the debug build shows a race condition in cudf's valid_if_n_kernel
Running cuda-mecheck's synccheck on the debug build errors out with rmm::bad_alloc
Breaking after the percentile call if it returns a NaN, we generally but not always can repeatedly call cp.percentile on the input column and get the NaN everytime. Sometimes however, we can repeatedly call cp.percentile on the column that just produced the NaN and never get a NaN in the output.
Dropping to cuda-gdb in the debug build after a NaN has been produced in the percentile output and printing the device memory for column, we see an array of all zeroes. This may be a debugger artifact rather than a real issue.
Running cuda-memcheck's initcheck on the release build shows a whole assortment of issues, including some in upstream libraries
Running cp.sum on the column input to cp.percentile never yields a NaN

My leading hypothesis based on the symptoms is a race condition, but I have not been able to further isolate it. I'm still looking into the specific race conditions reported in valid_if_n_kernel.

wphicks commented 3 years ago

I have now confirmed that the ValueError is the result of this cuPy issue and was able to reproduce it independently of cuML. I'll create a workaround.

I still am not certain about the causes of the other observed errors or even if they still apply. I was only able to reproduce them about one in every 100K runs before, and I haven't seen them in quite some time. I'll look into them just a little bit more and see if I can get a reproducer, but otherwise I'm tempted to remove the xfail once the workaround for the percentile issue is in place.

wphicks commented 3 years ago

Partial fix available here: https://github.com/rapidsai/cuml/pull/3315. This eliminates the ValueError, but I cannot guarantee that the other sporadic failure cases have been addressed.

wphicks commented 3 years ago

I think we're past the window we set for watching for additional failures. @JohnZed Should we close this out or wait until we finalize 0.18 to give it a bit longer to appear?

viclafargue commented 3 years ago

For some reason, it seems that there's still a problem with KBinsDiscretizer : CI log. It will be very hard to debug failures that happen that rarely...

wphicks commented 3 years ago

Using Hypothesis, I was able to find either some consistent reproducers of at least one of the problems represented by this issue or else consistent reproducers for a new issue. I'll give the simplest one that I've found so far below:

import cupy as cp
import numpy as np
from cuml.common.input_utils import input_to_cupy_array
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBD
from sklearn.preprocessing import KBinsDiscretizer as skKBD
magic_number=-3*2**25
X_np = np.random.rand(5, 2)
X_np[3, 1] = magic_number
X_np[4, 1] = magic_number

print(X_np)
# [[ 5.21711282e-01  3.01647383e-01]
#  [ 9.73775159e-01  3.78120961e-01]
#  [ 5.92897501e-01  6.84050667e-01]
#  [ 2.97061781e-01 -1.00663296e+08]
#  [ 2.09278387e-01 -1.00663296e+08]]

X = input_to_cupy_array(X_np).array
n_bins=20
encode='ordinal'
strategy='quantile'
cu_trans = cuKBD(n_bins=n_bins, encode=encode, strategy=strategy)
sk_trans = skKBD(n_bins=n_bins, encode=encode, strategy=strategy)
t_X = cu_trans.fit_transform(X)
print(t_X)
# [[10  5]
#  [19 10]
#  [15 14]
#  [ 5  0]
#  [ 0  0]]
skt_X = sk_trans.fit_transform(X_np)
print(skt_X)
# [[10.  6.]
#  [19. 11.]
#  [15. 15.]
#  [ 5.  1.]
#  [ 0.  1.]]

As demonstrated by the random values in the input, the rest of the matrix does not seem to matter so long as this precise number appears (as far as I can tell) at least twice. Tweaking even a single digit of this magic value causes the error to disappear. This is not the only magic value that Hypothesis has found thus far, but it is the first one I've found that reproduces on such a small input matrix.

This example obviously does not need the full 20 bins that it is asked to use. Hypothesis also found some much larger (500x20) inputs that reproduced this, though I am not sure if they still had too many bins.

The investigation continues...

wphicks commented 3 years ago

I have some evidence now that this is the result of a numpy bug. The following returns False:

import numpy as np
arr = np.array([-3*2**25, 1.2, -3*2**25, 2.1, 3.5])
quantiles = np.linspace(0, 100, 21)
percentile = np.percentile(arr, quantiles, interpolation='linear')
print(percentile[0] <= percentile[1])

If we follow the sklearn KBinsDiscretizer code, we can see that the intermediate results diverge from ours at precisely this calculation and ultimately result in results like that shown above. I'm digging into numpy now to see if I can understand precisely where the error emerges in the percentile calculation.

wphicks commented 3 years ago

Probably related: https://github.com/scikit-learn/scikit-learn/issues/13194 and https://github.com/numpy/numpy/issues/10373

wphicks commented 3 years ago

Fixed by https://github.com/numpy/numpy/pull/16273, which is part of numpy 1.20.0. As of 2 days ago, 1.20.0 became available on conda-forge, and it is now the default version installed in our environments. I'm going to do some further testing to see if there is some other issue which is also cropping up here, but I'm beginning to have some hope that this has been resolved.

wphicks commented 3 years ago

There is still an issue because of https://github.com/cupy/cupy/issues/4607, which presents the exact same problem as in numpy but for different "magic" input values. I'm going to put in a quick workaround and submit a fix to cupy.

wphicks commented 3 years ago

The "quick workaround" was not actually correct, but I think there's a messier workaround available. I'm putting in a PR to xfail the test since we're about to hit burndown and then try to figure out a better approach until the fix works its way into a version of cupy that we can use.

wphicks commented 3 years ago

The plan now is to wait for https://github.com/cupy/cupy/pull/4617 to get in before removing the xfail again. I'll do a couple tests in the meantime to make sure there's not another issue lurking somewhere in here, but cupy.percentile definitely needs to get updated for us to match sklearn output consistently.

JohnZed commented 3 years ago

@wphicks similar to the other issue (#3481), maybe this one is actually closed by cupy fix?

wphicks commented 3 years ago

Yep, with https://github.com/rapidsai/integration/pull/230 in place, I think it's worth removing the xfail and seeing if we see it again. I was not 100% certain that the cupy issue was the only problem here, but I'm confident enough that I think we should give it a go.

viclafargue commented 3 years ago

Reopening the issue, as new test failures were observed with the quantile strategy.