rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
689 stars 182 forks source link

[BUG] cuSparse CUDA 12.2 library breaks cuGraph spectral clustering test #2186

Open bdice opened 5 months ago

bdice commented 5 months ago

This is a tracking issue for a bug observed in cuGraph PR 4088, which adds CUDA 12.2 support.

The error log looks like this:

22/28 Test #20: CAPI_LEGACY_SPECTRAL_TEST .................***Failed    1.54 sec
'./../../..//bin/gtests/libcugraph_c/CAPI_LEGACY_SPECTRAL_TEST'
RUNNING: test_spectral...done (1.000000 seconds). - passed
RUNNING: test_balanced_cut_equal_weight...ASSERTION FAILED: cluster results don't match
done (0.000000 seconds). - FAILED
RUNNING: test_balanced_cut_une
qual_weight...ASSERTION FAILED: cluster results don't match
done (0.000000 seconds). - FAILED
RUNNING: test_balanced_cut_no_weight...ASSERTION FAILED: cluster results don't match
done (0.000000 seconds). - FAILED
CMake Error at run_gpu_test.cmake:34 (execute_process):

This occurs due to a known bug in cuSparse which will be fixed in a future CUDA Toolkit version.

The following RAFT PRs are related in some way to this bug (attempted fixes, accidental reversions, adding tests, etc.).

Until the bug is fixed, I will attempt to disable the failing tests in cuGraph.

cc: @cjnolet @ChuckHastings @mfoerste4 @trxcllnt @jakirkham @jameslamb

cjnolet commented 5 months ago

I think “but is fixed” is a pretty loaded term at this point. The but is fixed from the cusparse side but the fix won’t be in cuda 12.2. On the raft side we aren’t really fixing a bug, we are working around it and I think both Paul and I are still scratching our heads at why the code we added even fixed the issue to begin with (which is making it even harder to find the fix this time around) Sent from my iPhoneOn Feb 14, 2024, at 4:22 PM, Bradley Dice @.> wrote: This is a tracking issue for a bug observed in cuGraph PR 4088, which adds CUDA 12.2 support. The error log looks like this: 22/28 Test #20: CAPI_LEGACY_SPECTRAL_TEST .................Failed 1.54 sec './../../..//bin/gtests/libcugraph_c/CAPI_LEGACY_SPECTRAL_TEST' RUNNING: test_spectral...done (1.000000 seconds). - passed RUNNING: test_balanced_cut_equal_weight...ASSERTION FAILED: cluster results don't match done (0.000000 seconds). - FAILED RUNNING: test_balanced_cut_une qual_weight...ASSERTION FAILED: cluster results don't match done (0.000000 seconds). - FAILED RUNNING: test_balanced_cut_no_weight...ASSERTION FAILED: cluster results don't match done (0.000000 seconds). - FAILED CMake Error at run_gpu_test.cmake:34 (execute_process):

This occurs due to a known bug in cuSparse which will be fixed in a future CUDA Toolkit version. The following RAFT PRs are related in some way to this bug (attempted fixes, accidental reversions, adding tests, etc.).

2185

2184

2179

2173

2124

2117

Until the bug is fixed, I will attempt to disable the failing tests in cuGraph. cc: @cjnolet @ChuckHastings @mfoerste4 @trxcllnt @jakirkham @jameslamb

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>