[FEA] Decide what versions of cupy to support

vyasr commented 1 year ago

This issue is essentially a cupy version of #12398 that I am raising as a result of needing #12883 to update cudf for supporting recent changes to cupy. We currently claim to support cupy 9.5-11.*, but our test suite will have some failures when run using cupy < 11.6. We should determine what an appropriate support matrix here is.

vyasr commented 1 year ago

@bdice @wence- thoughts on what versions you would like to see supported here?

CC @shwina

bdice commented 1 year ago

If our test suite requires cupy >=11.6, I think that's what we should pin to. If we want to extend support to 11.x then we should make our tests pass with some set of earlier 11.x versions. We should not allow dependency versions that do not pass our test suite.

bdice commented 1 year ago

I would recommend only supporting very recent versions of cupy, which would mean pinning to >=11.6. We rely heavily on the library and its cutting edge features, for both platform support (e.g. CUDA 12, sm90) and API support (e.g. dlpack).

vyasr commented 1 year ago

Definitely agree that we shouldn't allow versions that don't pass our test suite. Requiring a cupy version that was released within the month feels rather onerous though. We really don't use much in the way of "cutting edge features" from cupy. I'm fairly certain the list of examples you gave is exhaustive, with the small exception of some small changes to RNGs that we use in sampling. RAPIDS is going to be supporting both CUDA 11 and CUDA 12 for the foreseeable future (once we roll out CUDA 12 support) so it would be odd to use CUDA 12 or recent hardware as a reason to drop support for older cupy versions. IMO if tests pass on 11.x with an older cupy version on older hardware/CUDA we should continue to allow its usage on CUDA 11 systems. I'm certainly fine dropping any version that requires us to make modifications to our tests to get them to pass now, though.

wence- commented 1 year ago

So CUDA 12 support arrived (with a jit compilation slowness that was fixed in 11.6) in 11.5; Support for CUDA 11.8 (and H100s) arrived in 11.3. If we intend to continue testing on CUDA 11 and CUDA 12 for the foreseeable, which if we're supporting both we should be doing, then I'd be inclined to pick something around the 11.3 mark for oldest supported version.

FWIW, locally with 11.5 I see these failures:

FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[1-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[1-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[100-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[100-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[6000000-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_write_statistics[6000000-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[2-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[2-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[100-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[100-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[6000000-STRIPE] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_chunked_write_statistics[6000000-ROWGROUP] - AssertionError: assert False
FAILED python/cudf/cudf/tests/test_orc.py::test_nanoseconds_overflow - AssertionError: numpy array are different
FAILED python/cudf/cudf/tests/test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different
FAILED python/cudf/cudf/tests/test_orc.py::test_orc_writer_negative_timestamp - AssertionError: numpy array are different

But I also see these failures with 11.6 (so I don't think this is a cupy thing?)

There is a further difficulty here which is that our test suite always runs with whatever the most recently available cupy is (subject to upper-bound constraints). So someone could introduce functionality that breaks in older versions, but we would never notice until a bug comes in. I think this speaks to the broader question of how to do testing and version requirements (similar to the pynvml discussions we're having elsewhere).

wence- commented 1 year ago

FWIW, locally with 11.5 I see these failures:

These same failures appear with 11.3 (and are the only ones). So if these are red herrings then I think a lower bound of 11.3 would be fine.

shwina commented 1 year ago

I believe these are red herrings, and have to do with the locale.

bdice commented 1 year ago

See #7314. Those test failures are consistently seen in rapids-compose Docker environments and can be safely ignored for this purpose (but the underlying bug remains somewhat of a mystery).

vyasr commented 1 year ago

The ORC failures are indeed red herrings due to locale and are most likely occurring because @wence- is developing using rapids-compose, see #7314. The failures to be concerned about from #12883 are the ones around ufuncs or dlpack.

@wence- I'm not sure how you're not seeing any other failures with 11.3 though. Wasn't #12883 correcting for a change only present in 11.6? Wouldn't we still need to condition that test?

wence- commented 1 year ago

@wence- I'm not sure how you're not seeing any other failures with 11.3 though. Wasn't #12883 correcting for a change only present in 11.6? Wouldn't we still need to condition that test?

I hadn't pulled in #12883. Rebuilding the environment with gcc-11, I'll update in an hour or two...

vyasr commented 1 year ago

Without #12883 I would have expected you to see the dlpack failures with cupy 11.6

wence- commented 1 year ago

Without #12883 I would have expected you to see the dlpack failures with cupy 11.6

I have yet to run with 11.6 (my environment had 11.5).

vyasr commented 1 year ago

As of #13284 we require cupy>=12.0.0.

rapidsai / cudf

[FEA] Decide what versions of cupy to support #12884