rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

Remove NumPy <2 pin #6031

Closed seberg closed 2 months ago

seberg commented 3 months ago

This PR removes the NumPy<2 pin which is expected to work for RAPIDS projects once CuPy 13.3.0 is released (CuPy 13.2.0 had some issues preventing the use with NumPy 2).

jakirkham commented 3 months ago

Updating branch to pull in the latest upstream changes and restart CI now that cuDF is done: https://github.com/rapidsai/cudf/pull/16300

jakirkham commented 3 months ago

One GHA job failed with an unrelated CUDA initialization error

Unfortunately this seems to be showing up more in CI:

Will raise this offline for discussion

```python E UserWarning: Error getting driver and runtime versions: E E stdout: E E E E stderr: E E Traceback (most recent call last): E File "/opt/conda/envs/test/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 254, in ensure_initialized E self.cuInit(0) E File "/opt/conda/envs/test/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 327, in safe_cuda_api_call E self._check_ctypes_error(fname, retcode) E File "/opt/conda/envs/test/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 395, in _check_ctypes_error E raise CudaAPIError(retcode, msg) E numba.cuda.cudadrv.driver.CudaAPIError: [999] Call to cuInit results in CUDA_ERROR_UNKNOWN E E During handling of the above exception, another exception occurred: E E Traceback (most recent call last): E File "", line 4, in E File "/opt/conda/envs/test/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 292, in __getattr__ E self.ensure_initialized() E File "/opt/conda/envs/test/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized E raise CudaSupportError(f"Error at driver init: {description}") E numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_UNKNOWN (999) E E E Not patching Numba ```

For now will just restart the failed jobs after the others complete

Edit: Another job had the same sort of error

Edit 2: And again in this job after restarting

jakirkham commented 2 months ago

/merge

jakirkham commented 2 months ago

Am seeing the wheel-tests-cuml CUDA 12.5 job getting stuck in the Dask tests (though don't see this with the CUDA 11.8 job). Not sure why that is given both would be using the same NumPy versions. Going to try merging in the upstream branch in case there is some fix we are missing. If there is an issue with this CI node, maybe that will give us a new one as well

jakirkham commented 2 months ago

Am seeing the wheel-tests-cuml CUDA 12.5 job getting stuck in the Dask tests (though don't see this with the CUDA 11.8 job).

Still seeing this issue. Going to test CI separately from this change in PR: https://github.com/rapidsai/cuml/pull/6047

jakirkham commented 2 months ago

So that PR's CI builds fail because pynndescent is pinned to the old version ( and thus doesn't have this fix: https://github.com/lmcinnes/pynndescent/pull/242 )

jakirkham commented 2 months ago

Was searching around in the code for clues. Just came across this, which was unexpected

https://github.com/rapidsai/cuml/blob/973a65ff703a14d9dd8f355af373e2053444b27a/python/cuml/cuml/neighbors/CMakeLists.txt#L38-L40

Does cuML need NumPy at build time?

If so, would have expected to see cimport numpy or similar in those Cython files, but am not seeing that

jakirkham commented 2 months ago

Divye documented the CI hang occurring with the pytest cuml-dask CUDA 12.5 wheel job in issue: https://github.com/rapidsai/cuml/issues/6050

Also he added a skip for that test in PR: https://github.com/rapidsai/cuml/pull/6051

Unfortunately other CI jobs still fail due to NumPy 2 being unconstrained and an incompatible pynndescent being installed as observed in a no change PR: https://github.com/rapidsai/cuml/pull/6047#issuecomment-2316017613

Fortunately the latter fix is already here

In the hopes of getting CI to pass, have merged Divye's PR into this one. That way all the fixes and skips for CI are in one place

jakirkham commented 2 months ago

Looks like this CI job had one test failure

______________________ test_weighted_kmeans[10-10-25-100] ______________________
[gw0] linux -- Python 3.11.9 /pyenv/versions/3.11.9/bin/python

nrows = 100, ncols = 25, nclusters = 10, max_weight = 10, random_state = 428096

    @pytest.mark.parametrize("nrows", [100, 500])
    @pytest.mark.parametrize("ncols", [25])
    @pytest.mark.parametrize("nclusters", [5, 10])
    @pytest.mark.parametrize("max_weight", [10])
    def test_weighted_kmeans(nrows, ncols, nclusters, max_weight, random_state):

        # Using fairly high variance between points in clusters
        cluster_std = 1.0
        np.random.seed(random_state)

        # set weight per sample to be from 1 to max_weight
        wt = np.random.randint(1, high=max_weight, size=nrows)

        X, y = make_blobs(
            nrows,
            ncols,
            nclusters,
            cluster_std=cluster_std,
            shuffle=False,
            random_state=0,
        )

        cuml_kmeans = cuml.KMeans(
            init="k-means++",
            n_clusters=nclusters,
            n_init=10,
            random_state=random_state,
            output_type="numpy",
        )

        cuml_kmeans.fit(X, sample_weight=wt)
        cu_score = cuml_kmeans.score(X)

        sk_kmeans = cluster.KMeans(random_state=random_state, n_clusters=nclusters)
        sk_kmeans.fit(cp.asnumpy(X), sample_weight=wt)
        sk_score = sk_kmeans.score(cp.asnumpy(X))

>       assert abs(cu_score - sk_score) <= cluster_std * 1.5
E       assert 6151.191162109375 <= (1.0 * 1.5)
E        +  where 6151.191162109375 = abs((-2365.749267578125 - -8516.9404296875))

test_kmeans.py:174: AssertionError
---------------------------- Captured stdout setup -----------------------------
[D] [20:29:31.325625] /__w/cuml/cuml/python/cuml/build/cp311-cp311-linux_aarch64/cuml/internals/logger.cxx:5269 Random seed: 428096

Not entirely sure why that happened (or why this only happens now)

Given we don't see this test failure anywhere else, am going to assume this was a flaky test and try restarting

Though documenting here in case it comes up again (in the event it needs follow up)

jakirkham commented 2 months ago

Thanks all for your help here! 🙏

Looks like it passed and the old merge comment ( https://github.com/rapidsai/cuml/pull/6031#issuecomment-2310921588 ) took affect

Let's follow up on the hanging test in issue: https://github.com/rapidsai/cuml/issues/6050

Happy to discuss anything else in new issues 🙂