Closed seberg closed 2 months ago
Updating branch to pull in the latest upstream changes and restart CI now that cuDF is done: https://github.com/rapidsai/cudf/pull/16300
One GHA job failed with an unrelated CUDA initialization error
Unfortunately this seems to be showing up more in CI:
Will raise this offline for discussion
For now will just restart the failed jobs after the others complete
Edit: Another job had the same sort of error
Edit 2: And again in this job after restarting
/merge
Am seeing the wheel-tests-cuml
CUDA 12.5 job getting stuck in the Dask tests (though don't see this with the CUDA 11.8 job). Not sure why that is given both would be using the same NumPy versions. Going to try merging in the upstream branch in case there is some fix we are missing. If there is an issue with this CI node, maybe that will give us a new one as well
Am seeing the
wheel-tests-cuml
CUDA 12.5 job getting stuck in the Dask tests (though don't see this with the CUDA 11.8 job).
Still seeing this issue. Going to test CI separately from this change in PR: https://github.com/rapidsai/cuml/pull/6047
So that PR's CI builds fail because pynndescent
is pinned to the old version ( and thus doesn't have this fix: https://github.com/lmcinnes/pynndescent/pull/242 )
Was searching around in the code for clues. Just came across this, which was unexpected
Does cuML need NumPy at build time?
If so, would have expected to see cimport numpy
or similar in those Cython files, but am not seeing that
Divye documented the CI hang occurring with the pytest cuml-dask
CUDA 12.5 wheel job in issue: https://github.com/rapidsai/cuml/issues/6050
Also he added a skip for that test in PR: https://github.com/rapidsai/cuml/pull/6051
Unfortunately other CI jobs still fail due to NumPy 2 being unconstrained and an incompatible pynndescent
being installed as observed in a no change PR: https://github.com/rapidsai/cuml/pull/6047#issuecomment-2316017613
Fortunately the latter fix is already here
In the hopes of getting CI to pass, have merged Divye's PR into this one. That way all the fixes and skips for CI are in one place
Looks like this CI job had one test failure
______________________ test_weighted_kmeans[10-10-25-100] ______________________
[gw0] linux -- Python 3.11.9 /pyenv/versions/3.11.9/bin/python
nrows = 100, ncols = 25, nclusters = 10, max_weight = 10, random_state = 428096
@pytest.mark.parametrize("nrows", [100, 500])
@pytest.mark.parametrize("ncols", [25])
@pytest.mark.parametrize("nclusters", [5, 10])
@pytest.mark.parametrize("max_weight", [10])
def test_weighted_kmeans(nrows, ncols, nclusters, max_weight, random_state):
# Using fairly high variance between points in clusters
cluster_std = 1.0
np.random.seed(random_state)
# set weight per sample to be from 1 to max_weight
wt = np.random.randint(1, high=max_weight, size=nrows)
X, y = make_blobs(
nrows,
ncols,
nclusters,
cluster_std=cluster_std,
shuffle=False,
random_state=0,
)
cuml_kmeans = cuml.KMeans(
init="k-means++",
n_clusters=nclusters,
n_init=10,
random_state=random_state,
output_type="numpy",
)
cuml_kmeans.fit(X, sample_weight=wt)
cu_score = cuml_kmeans.score(X)
sk_kmeans = cluster.KMeans(random_state=random_state, n_clusters=nclusters)
sk_kmeans.fit(cp.asnumpy(X), sample_weight=wt)
sk_score = sk_kmeans.score(cp.asnumpy(X))
> assert abs(cu_score - sk_score) <= cluster_std * 1.5
E assert 6151.191162109375 <= (1.0 * 1.5)
E + where 6151.191162109375 = abs((-2365.749267578125 - -8516.9404296875))
test_kmeans.py:174: AssertionError
---------------------------- Captured stdout setup -----------------------------
[D] [20:29:31.325625] /__w/cuml/cuml/python/cuml/build/cp311-cp311-linux_aarch64/cuml/internals/logger.cxx:5269 Random seed: 428096
Not entirely sure why that happened (or why this only happens now)
Given we don't see this test failure anywhere else, am going to assume this was a flaky test and try restarting
Though documenting here in case it comes up again (in the event it needs follow up)
Thanks all for your help here! 🙏
Looks like it passed and the old merge comment ( https://github.com/rapidsai/cuml/pull/6031#issuecomment-2310921588 ) took affect
Let's follow up on the hanging test in issue: https://github.com/rapidsai/cuml/issues/6050
Happy to discuss anything else in new issues 🙂
This PR removes the NumPy<2 pin which is expected to work for RAPIDS projects once CuPy 13.3.0 is released (CuPy 13.2.0 had some issues preventing the use with NumPy 2).