rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

Closed alexbarghi-nv closed 1 month ago

alexbarghi-nv commented 1 month ago

We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the cugraph-gnn repository going forward, where we can pin a specific pytorch version for testing.

jameslamb commented 1 month ago

(Summarizing some offline conversations, to get this into the public record here on GitHub)

For the last few days (unsure how long), CI jobs here targeting branch-24.10 have been silently getting 24.12 nightly packages. This PR fixes that, and that's exposing a dependency conflict for cugraph-pyg.

conda cannot install cugraph-pyg and pytorch>=2.3,<2.4 together, because there are not any pyg packages that support pytorch>=2.3.

full conda solve error trace (click me) ```text Looking for: ['cugraph-pyg=24.10', "pytorch[version='>=2.3,<2.4']", 'ogb'] Pinned packages: - python 3.10.* Could not solve for environment specs The following packages are incompatible ├─ cugraph-pyg 24.10** is installable with the potential options │ ├─ cugraph-pyg [24.10.00a84|24.10.00a85|24.10.00a86|24.10.00a94] would require │ │ └─ pyg >=2.5,<2.6 with the potential options │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ pytorch 1.12.* with the potential options │ │ │ ├─ pytorch [1.12.0|1.12.1], which can be installed; │ │ │ ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1] would require │ │ │ │ └─ python >=3.7,<3.8.0a0 , which can be installed; │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1] would require │ │ │ │ └─ python >=3.8,<3.9.0a0 , which can be installed; │ │ │ └─ pytorch [1.12.0|1.12.1|...|2.3.1] would require │ │ │ └─ python >=3.9,<3.10.0a0 , which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ pytorch 1.13.* with the potential options │ │ │ ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.13.0|1.13.1], which can be installed; │ │ │ └─ pytorch [1.13.0|1.13.1|...|2.3.1] would require │ │ │ └─ python >=3.11,<3.12.0a0 , which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ pytorch 2.0.* with the potential options │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained); │ │ │ └─ pytorch [2.0.0|2.0.1], which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ pytorch 2.1.* with the potential options │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [2.1.0|2.1.1|2.1.2], which can be installed; │ │ │ └─ pytorch [2.1.0|2.1.2|...|2.3.1] would require │ │ │ └─ python >=3.12,<3.13.0a0 , which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ pytorch 2.2.* with the potential options │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained); │ │ │ ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained); │ │ │ └─ pytorch [2.2.0|2.2.1|2.2.2], which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ python >=3.11,<3.12.0a0 , which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ python >=3.12,<3.13.0a0 , which can be installed; │ │ ├─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ │ └─ python >=3.8,<3.9.0a0 , which can be installed; │ │ └─ pyg [2.5.0|2.5.1|2.5.2] would require │ │ └─ python >=3.9,<3.10.0a0 , which can be installed; │ ├─ cugraph-pyg 24.10.00a0 would require │ │ └─ cugraph 24.10.00a0.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a17 would require │ │ └─ cugraph 24.10.00a17.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a19 would require │ │ └─ cugraph 24.10.00a19.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a22 would require │ │ └─ cugraph 24.10.00a22.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a28 would require │ │ └─ cugraph 24.10.00a28.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a38 would require │ │ └─ cugraph 24.10.00a38.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a39 would require │ │ └─ cugraph 24.10.00a39.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a40 would require │ │ └─ cugraph 24.10.00a40.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a44 would require │ │ └─ cugraph 24.10.00a44.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a45 would require │ │ └─ cugraph 24.10.00a45.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a48 would require │ │ └─ cugraph 24.10.00a48.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a49 would require │ │ └─ cugraph 24.10.00a49.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a50 would require │ │ └─ cugraph 24.10.00a50.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a52 would require │ │ └─ cugraph 24.10.00a52.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a53 would require │ │ └─ cugraph 24.10.00a53.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a54 would require │ │ └─ cugraph 24.10.00a54.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a55 would require │ │ └─ cugraph 24.10.00a55.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a56 would require │ │ └─ cugraph 24.10.00a56.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a57 would require │ │ └─ cugraph 24.10.00a57.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a58 would require │ │ └─ cugraph 24.10.00a58.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a59 would require │ │ └─ cugraph 24.10.00a59.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a60 would require │ │ └─ cugraph 24.10.00a60.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a61 would require │ │ └─ cugraph 24.10.00a61.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a63 would require │ │ └─ cugraph 24.10.00a63.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a65 would require │ │ └─ cugraph 24.10.00a65.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a66 would require │ │ └─ cugraph 24.10.00a66.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a68 would require │ │ └─ cugraph 24.10.00a68.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a69 would require │ │ └─ cugraph 24.10.00a69.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a70 would require │ │ └─ cugraph 24.10.00a70.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a71 would require │ │ └─ cugraph 24.10.00a71.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a72 would require │ │ └─ cugraph 24.10.00a72.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a73 would require │ │ └─ cugraph 24.10.00a73.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a74 would require │ │ └─ cugraph 24.10.00a74.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a75 would require │ │ └─ cugraph 24.10.00a75.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a77 would require │ │ └─ cugraph 24.10.00a77.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a78 would require │ │ └─ cugraph 24.10.00a78.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a79 would require │ │ └─ cugraph 24.10.00a79.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a80 would require │ │ └─ cugraph 24.10.00a80.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a81 would require │ │ └─ cugraph 24.10.00a81.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a82 would require │ │ └─ cugraph 24.10.00a82.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg 24.10.00a83 would require │ │ └─ cugraph 24.10.00a83.* , which does not exist (perhaps a missing channel); │ ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require │ │ └─ python >=3.11,<3.12.0a0 , which can be installed; │ ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require │ │ └─ python >=3.12,<3.13.0a0 , which can be installed; │ └─ cugraph-pyg [24.10.00a87|24.10.00a88|24.10.00a89|24.10.00a91|24.10.00a93] would require │ ├─ pyg >=2.5,<2.6 , which can be installed (as previously explained); │ └─ pytorch >=2.3,<2.4.0a0 with the potential options │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); │ ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained); │ ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained); │ └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported; ├─ libtorch is installable with the potential options │ ├─ libtorch 2.3.1 would require │ │ └─ pytorch 2.3.1 cuda118_*_300, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cpu_generic_*_2, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cpu_generic_*_3, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cpu_mkl_*_102, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cpu_mkl_*_103, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda112_*_302, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda112_*_303, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda118_*_302, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda118_*_303, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda120_*_302, which can be installed; │ ├─ libtorch 2.1.0 would require │ │ └─ pytorch 2.1.0 cuda120_*_303, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_generic_*_4, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_generic_*_0, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_generic_*_1, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_generic_*_3, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_mkl_*_100, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_mkl_*_101, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_mkl_*_103, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cpu_mkl_*_104, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda112_*_300, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda112_*_301, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda118_*_301, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda118_*_303, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda118_*_300, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda118_*_304, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda120_*_301, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda120_*_303, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda120_*_300, which can be installed; │ ├─ libtorch 2.1.2 would require │ │ └─ pytorch 2.1.2 cuda120_*_304, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cpu_generic_*_0, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cpu_generic_*_1, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cpu_mkl_*_101, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cpu_mkl_*_100, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cuda118_*_301, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cuda118_*_300, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cuda120_*_301, which can be installed; │ ├─ libtorch 2.3.0 would require │ │ └─ pytorch 2.3.0 cuda120_*_300, which can be installed; │ ├─ libtorch 2.3.1 would require │ │ └─ pytorch 2.3.1 cpu_generic_*_0, which can be installed; │ ├─ libtorch 2.3.1 would require │ │ └─ pytorch 2.3.1 cpu_mkl_*_100, which can be installed; │ ├─ libtorch 2.3.1 would require │ │ └─ pytorch 2.3.1 cuda120_*_300, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cpu_generic_*_1, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cpu_generic_*_0, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cpu_mkl_*_100, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cpu_mkl_*_101, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cuda118_*_300, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cuda118_*_301, which can be installed; │ ├─ libtorch 2.4.0 would require │ │ └─ pytorch 2.4.0 cuda120_*_300, which can be installed; │ └─ libtorch 2.4.0 would require │ └─ pytorch 2.4.0 cuda120_*_301, which can be installed; └─ pytorch >=2.3,<2.4 is installable with the potential options ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained); ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained); ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained); └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported. [rapids-conda-retry] conda returned exit code: 1 [rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'Connection broken:', 'ConnectionError:', 'DependencyNeedsBuildingError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached', segfault exit code 139 [rapids-conda-retry ```
how to reproduce this (click me) ```shell docker run \ --rm \ --gpus 1 \ --env CI=false \ --env RAPIDS_BUILD_TYPE="pull-request" \ --env RAPIDS_REPOSITORY="rapidsai/cugraph" \ --env RAPIDS_REF_NAME=pull-request/4690 \ --env RAPIDS_SHA=922571b6db5f721a287897b3c5acc81b3fe07f69 \ -v $(pwd):/opt/work \ -w /opt/work \ --network host \ -it rapidsai/ci-conda:cuda11.8.0-rockylinux8-py3.10 \ bash RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)" rapids-logger "Downloading artifacts from previous jobs" CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp) PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python) rapids-logger "Generate Python testing dependencies" rapids-dependency-file-generator \ --output conda \ --file-key test_python \ --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg conda activate test_cugraph_pyg CONDA_CUDA_VERSION="11.8" PYG_URL="https://data.pyg.org/whl/torch-2.3.0+cu118.html" rapids-mamba-retry install \ --channel "${CPP_CHANNEL}" \ --channel "${PYTHON_CHANNEL}" \ --channel pyg \ "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "pytorch>=2.3,<2.4" \ "ogb" ```

This only shows up in the conda-python-tests / 12.5.1, 3.12, amd64, ubuntu22.04, v100, latest-driver, latest-deps job, because that's the only one where cugraph-pyg installation is currently tested on PRs.

https://github.com/rapidsai/cugraph/blob/5fad4356729482ae5d4843a6e74330f3aa81a59c/ci/test_python.sh#L187-L189

The PyTorch floor here was raised to pytorch>=2.3,<2.4 in #4615. Logs from that CI job on that PR show the issue:

cugraph                   24.12.00a16     cuda11_py310_240928_g59f70dd1b_16    rapidsai-nightly
cugraph-pyg               24.12.00a16     py310_240928_g59f70dd1b_16    rapidsai-nightly
...
pyg                       2.5.2           py310_torch_2.1.0_cpu    pyg
...
pytorch                   2.1.2           cuda118_py310h6f85f1b_304    conda-forge

(build link)

cugraph-pyg==24.12.* at that point still allow pytorch==2.1.* to be installed, which allowed conda to find a solution with pyg.

So what can we do?

Ideally, there would be pyg packages supporting pytorch>=2.3. It seemed like this PR from around months ago might have added that: https://github.com/pyg-team/pytorch_geometric/pull/9240.

But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc.

image

The options I can think of:

jameslamb commented 1 month ago

update on https://github.com/rapidsai/cugraph/pull/4690#issuecomment-2392357607

After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of pyg::pyg conda packages with conda-forge::pytorch_geometric.

commit: https://github.com/rapidsai/cugraph/pull/4690/commits/f267c771707d4007c6869b4a0a79feb3e0c27700

They're built from the same sources, and conda-forge::pytorch_geometric is a noarch package without an explicit PyTorch constraint.

jameslamb commented 1 month ago

All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉

The docs-build is broken, like this:

Extension error (sphinx.ext.autosummary): Handler <function process_generate_options at 0x7f0c6433e4d0> for event 'builder-inited' threw an exception (exception: no module named cugraph_dgl.convert)

(build link)

The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱

  + pytorch                                 2.1.2  cuda118_py310h6f85f1b_304          conda-forge            27MB
  ...
  + dgl                                     1.1.3  cuda112py310hdbdccad_2             conda-forge            44MB
  ...
  + pyg                                     2.5.2  py310_torch_2.1.0_cpu              pyg                     1MB
  ...
  + cugraph                              24.08.00  cuda11_py310_240808_gfc880db0c_0   rapidsai                2MB
  + cugraph-service-server               24.08.00  py310_240808_gfc880db0c_0          rapidsai               44kB
  + cugraph-pyg                          24.08.00  py310_240808_gfc880db0c_0          rapidsai              142kB
  + cugraph-dgl                          24.08.00  py310_0                            rapidsai              122kB

(build link)

It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages.

In my experience with sphinx, this type of "no module" error often means "there was an ImportError when trying to import that module", which can point to these other explanations:

There absolutely is a cugraph_dgl.convert module: https://github.com/rapidsai/cugraph/blob/branch-24.10/python/cugraph-dgl/cugraph_dgl/convert.py

I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue.

code to do that (click me) ```shell docker run \ --rm \ --gpus 1 \ --env CI=false \ --env RAPIDS_BUILD_TYPE="pull-request" \ --env RAPIDS_REPOSITORY="rapidsai/cugraph" \ --env RAPIDS_REF_NAME=pull-request/4690 \ --env RAPIDS_SHA=f267c771707d4007c6869b4a0a79feb3e0c27700 \ -v $(pwd):/opt/work \ -w /opt/work \ --network host \ -it rapidsai/ci-conda:cuda11.8.0-ubuntu22.04-py3.10 \ bash RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)" CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp) PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python) rapids-dependency-file-generator \ --output conda \ --file-key docs \ --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml rapids-mamba-retry env create --yes -f env.yaml -n docs conda activate docs if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then CONDA_CUDA_VERSION="11.8" DGL_CHANNEL="dglteam/label/cu118" else CONDA_CUDA_VERSION="12.1" DGL_CHANNEL="dglteam/label/cu121" fi rapids-mamba-retry install \ --channel "${CPP_CHANNEL}" \ --channel "${PYTHON_CHANNEL}" \ --channel conda-forge \ --channel nvidia \ --channel "${DGL_CHANNEL}" \ "libcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "pylibcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "cugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "cugraph-dgl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "cugraph-service-server=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "cugraph-service-client=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "libcugraph_etl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "pylibcugraphops=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ "pylibwholegraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \ pytorch \ "cuda-version=${CONDA_CUDA_VERSION}" python -c "import cugraph_dgl.convert" ```

DGL backend not selected or invalid. Assuming PyTorch for now. Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase) Traceback (most recent call last): File "", line 1, in File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph_dgl/init.py", line 18, in from cugraph_dgl.graph import Graph ... File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph/utilities/utils.py", line 410, in getattr raise RuntimeError(f"This feature requires the {self.name} " "package/module") RuntimeError: This feature requires the dgl package/module

Following that code shared above, that can reproduced without actually invoking sphinx-build:

python -c "import cugraph_dgl.convert"

Walking down the trace:

python -c "import dgl"

ModuleNotFoundError: No module named 'torchdata'

conda install -c conda-forge torchdata
python -c "import dgl"

ModuleNotFoundError: No module named 'pydantic'

conda install -c conda-forge pydantic
python -c "import dgl"

FileNotFoundError: Cannot find DGL C++ graphbolt library at /opt/conda/envs/docs/lib/python3.10/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.post300.so

So what do we do?

I'm not sure.

Looks like dgl's dependency on torchdata was removed in August:

Those seem to have not made it in until v2.4.0 (https://github.com/dmlc/dgl/releases/tag/v2.4.0)

Here in cugraph's CI, we're getting v2.1.0

  + dgl                                2.1.0.cu118  py310_0                              dglteam/label/cu118      606MB

(build link)

I'm not sure how to fix this. The cu118 label for this package doesn't have packages newer than v2.1.0:

https://anaconda.org/dglteam/dgl/files?version=&channel=cu118

Maybe we want the th23_cu118 label instead, now that cugraph is using PyTorch 2.3?

https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118

jameslamb commented 1 month ago

Summarizing recent commits:

dgl appears to have changed its versioning scheme for conda packages. The latest release of dgl (v2.4.0) has not been published to the dglteam channel under the main tag... they're now tags and version numbers that encode the supported PyTorch version and CUDA version.

Here in the 24.10 release of cugraph-dgl we want to support PyTorch 2.3 and CUDA 11.8, so I've switched cugraph-dgl to this runtime requirement:

dgl >= 2.4.0.th23.cu*

and requiring this label on the dglteam channel

--channel dglteam/label/th23_cu118

As @alexbarghi-nv pointed out to me, something similar is being done in cugraph-gnn already: https://github.com/rapidsai/cugraph-gnn/pull/10

For wheels, I've updated the cugraph-dgl wheels' dependency on dgl (only enforced via a pip install in a script, not wheel metadata) from dgl==2.0.0 to dgl==2.2.1... the latest version that wheels have been published for.

jameslamb commented 1 month ago

I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of cugraph, nx-cugraph, cudf, etc., PyTorch 2.3, and numpy 2.x.

Thanks for the help everyone!

jameslamb commented 1 month ago

/merge

jakirkham commented 1 month ago

Thanks James! 🙏