rapidsai / wholegraph

WholeGraph - large scale Graph Neural Networks
https://docs.rapids.ai/api/cugraph/stable/wholegraph/
Apache License 2.0
100 stars 38 forks source link

bump NCCL floor to 2.18.1.1, relax PyTorch pin #218

Closed jameslamb closed 1 month ago

jameslamb commented 2 months ago

Contributes to https://github.com/rapidsai/build-planning/issues/102

Fixes #217

Notes for Reviewers

How I tested this

Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from https://github.com/rapidsai/shared-workflows/pull/246.

Observed the exact same failures with CUDA 11.4 reported in https://github.com/rapidsai/build-planning/issues/102.

...
  + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB
...
./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST 

(build link)

Pushed a commit adding a floor of nccl>=2.18.1.1. Saw all tests pass with CUDA 11.4 😁

...
  + nccl                     2.22.3.1  hee583db_1                  conda-forge          131MB
...
(various log messages showing all tests passed)

(build link)

jameslamb commented 2 months ago

Thanks!

@linhu-nv could you please review here?

jameslamb commented 1 month ago

/merge