Closed alexbarghi-nv closed 1 month ago
(Summarizing some offline conversations, to get this into the public record here on GitHub)
For the last few days (unsure how long), CI jobs here targeting branch-24.10
have been silently getting 24.12
nightly packages. This PR fixes that, and that's exposing a dependency conflict for cugraph-pyg
.
conda
cannot install cugraph-pyg
and pytorch>=2.3,<2.4
together, because there are not any pyg
packages that support pytorch>=2.3
.
This only shows up in the conda-python-tests / 12.5.1, 3.12, amd64, ubuntu22.04, v100, latest-driver, latest-deps
job, because that's the only one where cugraph-pyg
installation is currently tested on PRs.
The PyTorch floor here was raised to pytorch>=2.3,<2.4
in #4615. Logs from that CI job on that PR show the issue:
cugraph 24.12.00a16 cuda11_py310_240928_g59f70dd1b_16 rapidsai-nightly
cugraph-pyg 24.12.00a16 py310_240928_g59f70dd1b_16 rapidsai-nightly
...
pyg 2.5.2 py310_torch_2.1.0_cpu pyg
...
pytorch 2.1.2 cuda118_py310h6f85f1b_304 conda-forge
cugraph-pyg==24.12.*
at that point still allow pytorch==2.1.*
to be installed, which allowed conda to find a solution with pyg
.
Ideally, there would be pyg
packages supporting pytorch>=2.3
. It seemed like this PR from around months ago might have added that: https://github.com/pyg-team/pytorch_geometric/pull/9240.
But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc.
The options I can think of:
cugraph-pyg
back to pytorch>=2.2
cugraph-pyg=24.10
release until there are pyg
packages supporting PyTorch 2.3pyg
packages supporting PyTorch from source and host them on RAPIDS-controlled channelsupdate on https://github.com/rapidsai/cugraph/pull/4690#issuecomment-2392357607
After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of pyg::pyg
conda packages with conda-forge::pytorch_geometric
.
commit: https://github.com/rapidsai/cugraph/pull/4690/commits/f267c771707d4007c6869b4a0a79feb3e0c27700
They're built from the same sources, and conda-forge::pytorch_geometric
is a noarch package without an explicit PyTorch constraint.
All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉
The docs-build
is broken, like this:
Extension error (sphinx.ext.autosummary): Handler <function process_generate_options at 0x7f0c6433e4d0> for event 'builder-inited' threw an exception (exception: no module named cugraph_dgl.convert)
The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱
+ pytorch 2.1.2 cuda118_py310h6f85f1b_304 conda-forge 27MB
...
+ dgl 1.1.3 cuda112py310hdbdccad_2 conda-forge 44MB
...
+ pyg 2.5.2 py310_torch_2.1.0_cpu pyg 1MB
...
+ cugraph 24.08.00 cuda11_py310_240808_gfc880db0c_0 rapidsai 2MB
+ cugraph-service-server 24.08.00 py310_240808_gfc880db0c_0 rapidsai 44kB
+ cugraph-pyg 24.08.00 py310_240808_gfc880db0c_0 rapidsai 142kB
+ cugraph-dgl 24.08.00 py310_0 rapidsai 122kB
It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages.
In my experience with sphinx
, this type of "no module" error often means "there was an ImportError
when trying to import that module", which can point to these other explanations:
There absolutely is a cugraph_dgl.convert
module: https://github.com/rapidsai/cugraph/blob/branch-24.10/python/cugraph-dgl/cugraph_dgl/convert.py
I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue.
DGL backend not selected or invalid. Assuming PyTorch for now. Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase) Traceback (most recent call last): File "
", line 1, in File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph_dgl/init.py", line 18, in from cugraph_dgl.graph import Graph ... File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph/utilities/utils.py", line 410, in getattr raise RuntimeError(f"This feature requires the {self.name} " "package/module") RuntimeError: This feature requires the dgl package/module
Following that code shared above, that can reproduced without actually invoking sphinx-build
:
python -c "import cugraph_dgl.convert"
Walking down the trace:
python -c "import dgl"
ModuleNotFoundError: No module named 'torchdata'
conda install -c conda-forge torchdata
python -c "import dgl"
ModuleNotFoundError: No module named 'pydantic'
conda install -c conda-forge pydantic
python -c "import dgl"
FileNotFoundError: Cannot find DGL C++ graphbolt library at /opt/conda/envs/docs/lib/python3.10/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.post300.so
I'm not sure.
Looks like dgl
's dependency on torchdata
was removed in August:
Those seem to have not made it in until v2.4.0
(https://github.com/dmlc/dgl/releases/tag/v2.4.0)
Here in cugraph
's CI, we're getting v2.1.0
+ dgl 2.1.0.cu118 py310_0 dglteam/label/cu118 606MB
I'm not sure how to fix this. The cu118
label for this package doesn't have packages newer than v2.1.0:
https://anaconda.org/dglteam/dgl/files?version=&channel=cu118
Maybe we want the th23_cu118
label instead, now that cugraph
is using PyTorch 2.3?
https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118
Summarizing recent commits:
dgl
appears to have changed its versioning scheme for conda packages. The latest release of dgl
(v2.4.0) has not been published to the dglteam
channel under the main
tag... they're now tags and version numbers that encode the supported PyTorch version and CUDA version.
Here in the 24.10 release of cugraph-dgl
we want to support PyTorch 2.3 and CUDA 11.8, so I've switched cugraph-dgl
to this runtime requirement:
dgl >= 2.4.0.th23.cu*
and requiring this label on the dglteam
channel
--channel dglteam/label/th23_cu118
As @alexbarghi-nv pointed out to me, something similar is being done in cugraph-gnn
already: https://github.com/rapidsai/cugraph-gnn/pull/10
For wheels, I've updated the cugraph-dgl
wheels' dependency on dgl
(only enforced via a pip
install in a script, not wheel metadata) from dgl==2.0.0
to dgl==2.2.1
... the latest version that wheels have been published for.
I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of cugraph
, nx-cugraph
, cudf
, etc., PyTorch 2.3, and numpy
2.x.
Thanks for the help everyone!
/merge
Thanks James! 🙏
We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the
cugraph-gnn
repository going forward, where we can pin a specific pytorch version for testing.