Open divyegala opened 3 years ago
Thanks for opening the issue. Since the errors pops out sporadically, I'll put them here as reference.
Fails on: test_kbinsdiscretizer[cudf-quantile-ordinal-5] – cuml.test.test_preprocessing
cuml/test/test_preprocessing.py:579:
E AssertionError:
E Not equal to tolerance rtol=1e-05, atol=1e-05
E
E Mismatched elements: 100 / 10000 (1%)
E Max absolute difference: 1.
E Max relative difference: 0.25
test_kbinsdiscretizer[cudf-quantile-ordinal-20] – cuml.test.test_preprocessing
cuml/test/test_preprocessing.py:579:
E AssertionError:
E Not equal to tolerance rtol=1e-05, atol=1e-05
E
E Mismatched elements: 25 / 10000 (0.25%)
E Max absolute difference: 1.
E Max relative difference: 0.05263158
test_kbinsdiscretizer[cudf-quantile-onehot-dense-5] – cuml.test.test_preprocessing
cuml/test/test_preprocessing.py:563:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cuml/_thirdparty/sklearn/utils/skl_dependencies.py:359: in fit_transform
return self.fit(X, **fit_params).transform(X)
cuml/_thirdparty/sklearn/preprocessing/_discretization.py:228: in fit
categories=np.array([np.arange(i) for i in self.n_bins_]),
/opt/conda/envs/rapids/lib/python3.7/site-packages/cupy/_creation/from_data.py:41: in array
return core.array(obj, dtype, copy, order, subok, ndmin)
cupy/core/core.pyx:2059: in cupy.core.core.array
???
cupy/core/core.pyx:2138: in cupy.core.core.array
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: Unsupported dtype object
cupy/core/core.pyx:2210: ValueError
@wphicks can look at this after current work on Silhouette score.
The reported value error comes from an intermittent failure in the percentile calculation here. Occasionally the final element in the percentile array is NaN, and the error propagates from there. Breaking in at that point with a debugger, we see that numpy's percentile
call returns the correct value, and moreover that converting the column
array to a numpy array and back allows cupy to compute the correct value as well. I am working to determine the root cause, but my leading hunch is uninitialized memory.
Something has changed in the past week in terms of reproducing this. I am still able to reproduce the issue, but it takes much longer to do so, and I have only seen a failure on the cudf-quantile onehot-dense
versions of this test, whereas before I saw it on at least some other cudf-quantile
tests. When I began tracking it, mean time to failure was around 180 iterations. Now, it is slightly under 500. I don't have any insight as to what has changed or whether it's specific to my system.
I can say with some confidence now that I cannot reproduce this with any other data input format except cudf. I've extracted the sequence of conversions that we perform into a standalone example now and am attempting to reproduce this independently of cuML.
Same, it's really hard to reproduce. I wonder if the disappearance of the ordinal
error is not linked to recent change in #3194 that converts cuDF dataframes to cuPy arrays before the call to input_to_cuml_array
.
I have fairly compelling evidence now that the ValueError is a result of some interaction with pytest's fixture
code. I have not successfully reproduced the ValueError in over 30K runs after refactoring the test to avoid using the fixture (but using exactly the same code to generate the test data). I still saw the AssertionError once.
It is worth mentioning that I changed the fixture scope to function
rather than session
for these tests.
Documenting some miscellaneous findings here:
valid_if_n_kernel
rmm::bad_alloc
percentile
call if it returns a NaN, we generally but not always can repeatedly call cp.percentile
on the input column
and get the NaN everytime. Sometimes however, we can repeatedly call cp.percentile
on the column that just produced the NaN and never get a NaN in the output.column
, we see an array of all zeroes. This may be a debugger artifact rather than a real issue.cp.sum
on the column input to cp.percentile
never yields a NaNMy leading hypothesis based on the symptoms is a race condition, but I have not been able to further isolate it. I'm still looking into the specific race conditions reported in valid_if_n_kernel
.
I have now confirmed that the ValueError
is the result of this cuPy issue and was able to reproduce it independently of cuML. I'll create a workaround.
I still am not certain about the causes of the other observed errors or even if they still apply. I was only able to reproduce them about one in every 100K runs before, and I haven't seen them in quite some time. I'll look into them just a little bit more and see if I can get a reproducer, but otherwise I'm tempted to remove the xfail once the workaround for the percentile
issue is in place.
Partial fix available here: https://github.com/rapidsai/cuml/pull/3315. This eliminates the ValueError, but I cannot guarantee that the other sporadic failure cases have been addressed.
I think we're past the window we set for watching for additional failures. @JohnZed Should we close this out or wait until we finalize 0.18 to give it a bit longer to appear?
For some reason, it seems that there's still a problem with KBinsDiscretizer : CI log. It will be very hard to debug failures that happen that rarely...
Using Hypothesis, I was able to find either some consistent reproducers of at least one of the problems represented by this issue or else consistent reproducers for a new issue. I'll give the simplest one that I've found so far below:
import cupy as cp
import numpy as np
from cuml.common.input_utils import input_to_cupy_array
from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBD
from sklearn.preprocessing import KBinsDiscretizer as skKBD
magic_number=-3*2**25
X_np = np.random.rand(5, 2)
X_np[3, 1] = magic_number
X_np[4, 1] = magic_number
print(X_np)
# [[ 5.21711282e-01 3.01647383e-01]
# [ 9.73775159e-01 3.78120961e-01]
# [ 5.92897501e-01 6.84050667e-01]
# [ 2.97061781e-01 -1.00663296e+08]
# [ 2.09278387e-01 -1.00663296e+08]]
X = input_to_cupy_array(X_np).array
n_bins=20
encode='ordinal'
strategy='quantile'
cu_trans = cuKBD(n_bins=n_bins, encode=encode, strategy=strategy)
sk_trans = skKBD(n_bins=n_bins, encode=encode, strategy=strategy)
t_X = cu_trans.fit_transform(X)
print(t_X)
# [[10 5]
# [19 10]
# [15 14]
# [ 5 0]
# [ 0 0]]
skt_X = sk_trans.fit_transform(X_np)
print(skt_X)
# [[10. 6.]
# [19. 11.]
# [15. 15.]
# [ 5. 1.]
# [ 0. 1.]]
As demonstrated by the random values in the input, the rest of the matrix does not seem to matter so long as this precise number appears (as far as I can tell) at least twice. Tweaking even a single digit of this magic value causes the error to disappear. This is not the only magic value that Hypothesis has found thus far, but it is the first one I've found that reproduces on such a small input matrix.
This example obviously does not need the full 20 bins that it is asked to use. Hypothesis also found some much larger (500x20) inputs that reproduced this, though I am not sure if they still had too many bins.
The investigation continues...
I have some evidence now that this is the result of a numpy bug. The following returns False
:
import numpy as np
arr = np.array([-3*2**25, 1.2, -3*2**25, 2.1, 3.5])
quantiles = np.linspace(0, 100, 21)
percentile = np.percentile(arr, quantiles, interpolation='linear')
print(percentile[0] <= percentile[1])
If we follow the sklearn KBinsDiscretizer
code, we can see that the intermediate results diverge from ours at precisely this calculation and ultimately result in results like that shown above. I'm digging into numpy now to see if I can understand precisely where the error emerges in the percentile calculation.
Fixed by https://github.com/numpy/numpy/pull/16273, which is part of numpy 1.20.0. As of 2 days ago, 1.20.0 became available on conda-forge, and it is now the default version installed in our environments. I'm going to do some further testing to see if there is some other issue which is also cropping up here, but I'm beginning to have some hope that this has been resolved.
There is still an issue because of https://github.com/cupy/cupy/issues/4607, which presents the exact same problem as in numpy but for different "magic" input values. I'm going to put in a quick workaround and submit a fix to cupy.
The "quick workaround" was not actually correct, but I think there's a messier workaround available. I'm putting in a PR to xfail the test since we're about to hit burndown and then try to figure out a better approach until the fix works its way into a version of cupy that we can use.
The plan now is to wait for https://github.com/cupy/cupy/pull/4617 to get in before removing the xfail again. I'll do a couple tests in the meantime to make sure there's not another issue lurking somewhere in here, but cupy.percentile
definitely needs to get updated for us to match sklearn output consistently.
@wphicks similar to the other issue (#3481), maybe this one is actually closed by cupy fix?
Yep, with https://github.com/rapidsai/integration/pull/230 in place, I think it's worth removing the xfail and seeing if we see it again. I was not 100% certain that the cupy issue was the only problem here, but I'm confident enough that I think we should give it a go.
Reopening the issue, as new test failures were observed with the quantile strategy.
When this PR is merged https://github.com/rapidsai/cuml/pull/2932, it will start marking a KBinsDiscretizer tests as xfail due to sporadic failures observed in CI