rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
680 stars 180 forks source link

[BUG] wheel tests do not fail when `raft-dask` wheel has unsatisfiable dependency requirements #2348

Closed jameslamb closed 1 month ago

jameslamb commented 1 month ago

Describe the bug

We recently observed a situation where raft-dask nightly wheels were being published with duplicated dependencies:

The unsuffixed ones are a mistake, fixed in #2347. However... that was only caught by cugraph's CI (build link).

It should have been caught here in raft's CI, probably here:

https://github.com/rapidsai/raft/blob/8ef71de26b01458f02f36ad96c1b3017cf985cc5/ci/test_wheel_raft_dask.sh#L14

Steps/Code to reproduce bug

Trying to reproduce a very recent CI build that passed despite using wheels that suffer from the issue fixed in #2437 (build link).

Ran a container mimicking what was used in that CI run.

docker run \
    --rm \
    --env NVIDIA_VISIBLE_DEVICES \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/raft" \
    --env RAPIDS_REF_NAME=pull-request/2343 \
    -it rapidsai/citestwheel:cuda12.2.2-ubuntu22.04-py3.9 \
    bash

And then the following code mirroring ci/test_wheel_raft_dask.sh (code link), with a bit of extra debugging stuff added.

setup mimicking what happens in CI (click me) Checked if there was extra `pip` configuration setup in the image. ```shell pip config list ``` Just one, an index URL. ```text # :env:.extra-index-url='https://pypi.anaconda.org/rapidsai-wheels-nightly/simple' ``` Checked the version of `pip`. ```shell pip --version # 23.0.1 ``` Installed `pkginfo` to inspect the wheels. ```shell pip install pkginfo ``` Downloaded wheels from the same CI run and put them in separate directories. ```shell mkdir -p ./dist RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})" # git ref (entered in interactive prompt): 04186e4 RAPIDS_PY_WHEEL_NAME="raft_dask_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./dist RAPIDS_PY_WHEEL_NAME="pylibraft_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-s3 ./local-pylibraft-dep ``` Inspected them to confirm that: * both wheels' `name` fields have `-cu12` suffix * `raft_dask` wheel depends on both `pylibraft-cu12` and `pylibraft` They do. ```shell # raft-dask pkginfo \ --field=name \ --field=version \ --field=requires_dist \ ./dist/raft_dask_cu12-*cp39*.whl # name: raft-dask-cu12 # version: 24.8.0a20 # requires_dist: ['dask-cuda==24.8.*,>=0.0.0a0', 'distributed-ucxx-cu12==0.39.*', 'joblib>=0.11', 'numba>=0.57', 'numpy<2.0a0,>=1.23', 'pylibraft-cu12==24.8.*,>=0.0.0a0', 'pylibraft==24.8.*,>=0.0.0a0', 'rapids-dask-dependency==24.8.*,>=0.0.0a0', 'ucx-py-cu12==0.39.*', 'ucx-py==0.39.*', 'pytest-cov; extra == "test"', 'pytest==7.*; extra == "test"'] # pylibraft pkginfo \ --field=name \ --field=version \ --field=requires_dist \ ./local-pylibraft-dep/pylibraft_cu12-*cp39*.whl # name: pylibraft-cu12 # version: 24.8.0a20 # requires_dist: ['cuda-python<13.0a0,>=12.0', 'numpy<2.0a0,>=1.23', 'rmm-cu12==24.8.*,>=0.0.0a0', 'cupy-cuda12x>=12.0.0; extra == "test"', 'pytest-cov; extra == "test"', 'pytest==7.*; extra == "test"', 'scikit-learn; extra == "test"', 'scipy; extra == "test"'] ``` Installed the `pylibraft` wheel, just as the test script does. ```shell python -m pip -v install --no-deps ./local-pylibraft-dep/pylibraft*.whl ``` That worked as expected. ```text Processing /local-pylibraft-dep/pylibraft_cu12-24.8.0a20-cp39-cp39-manylinux_2_28_x86_64.whl Installing collected packages: pylibraft-cu12 Successfully installed pylibraft-cu12-24.8.0a20 ```

With that set up (a raft_dask-cu12 wheel in ./dist and pylibraft-cu12already installed), I ran the following:

python -m pip -v install "raft_dask-${RAPIDS_PY_CUDA_SUFFIX}[test]>=0.0.0a0" --find-links dist/

Just like we observed in CI:

Successfully installed MarkupSafe-2.1.5 click-8.1.7 cloudpickle-3.0.0 coverage-7.5.3 cuda-python-12.5.0 dask-2024.5.1 dask-cuda-24.8.0a0 dask-expr-1.1.1 distributed-2024.5.1 distributed-ucxx-cu12-0.39.0a0 exceptiongroup-1.2.1 fsspec-2024.5.0 importlib-metadata-7.1.0 iniconfig-2.0.0 jinja2-3.1.4 joblib-1.4.2 libucx-cu12-1.15.0.post1 llvmlite-0.42.0 locket-1.0.0 msgpack-1.0.8 numba-0.59.1 numpy-1.26.4 packaging-24.0 pandas-2.2.2 partd-1.4.2 pluggy-1.5.0 psutil-5.9.8 pyarrow-16.1.0 pynvml-11.4.1 pytest-7.4.4 pytest-cov-5.0.0 python-dateutil-2.9.0.post0 pytz-2024.1 pyyaml-6.0.1 raft_dask-cu12-24.8.0a18 rapids-dask-dependency-24.8.0a4 rmm-cu12-24.8.0a6 six-1.16.0 sortedcontainers-2.4.0 tblib-3.0.0 tomli-2.0.1 toolz-0.12.1 tornado-6.4 tzdata-2024.1 ucx-py-cu12-0.39.0a0 ucxx-cu12-0.39.0a0 urllib3-2.2.1 zict-3.0.0 zipp-3.19

HOWEVER... this alternative form fails in the expected way.

python -m pip -v install ./dist/*.whl
ERROR: Could not find a version that satisfies the requirement ucx-py==0.39.* (from raft-dask-cu12) (from versions: 0.0.1.post1)
ERROR: No matching distribution found for ucx-py==0.39.*

Expected behavior

I expected CI to fail because the constraints pylibraft==24.8.* and ucx-py==0.39.* are not satisfiable (those packages do not exist).

Environment details (please complete the following information):

nvidia-smi (click me) ```text Fri May 31 12:06:47 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-32GB On | 00000000:06:00.0 Off | 0 | | N/A 33C P0 55W / 300W | 341MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-32GB On | 00000000:07:00.0 Off | 0 | | N/A 33C P0 42W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-32GB On | 00000000:0A:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-32GB On | 00000000:0B:00.0 Off | 0 | | N/A 29C P0 41W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2-32GB On | 00000000:85:00.0 Off | 0 | | N/A 31C P0 41W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2-32GB On | 00000000:86:00.0 Off | 0 | | N/A 30C P0 42W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2-32GB On | 00000000:89:00.0 Off | 0 | | N/A 34C P0 43W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2-32GB On | 00000000:8A:00.0 Off | 0 | | N/A 30C P0 43W / 300W | 3MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ ```

Additional context

The particular unsatisfiable dependency issue was likely introduced by recent changes adding rapids-build-backend (#2331, for https://github.com/rapidsai/build-planning/issues/31). But in theory this could just as easily happen with some other unrelated issue with dependencies, like a typo of the form joblibbbbb or something.

I am actively investigating this (along with @bdice and @nv-rliu). Just posting for documentation purposes.

nv-rliu commented 1 month ago

Wow, well written! Thanks for documenting this bug