openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.4k stars 361 forks source link

RCCL not recognized #8681

Open jglaser opened 5 months ago

jglaser commented 5 months ago

I get the following error in jax, even though RCCL is linked.

E0122 11:56:32.195687  107899 pjrt_stream_executor_client.cc:2766] Execution of replica 0 failed: UNIMPLEMENTED: XLA compiled without NCCL support
$ git rev-parse HEAD
f61f4721b948dd335bfa75851952bda977a3aae7
ldd /lustre/orion/stf006/world-shared/glaser/frontier-env/lib/python3.11/site-packages/jaxlib/xla_extension.so
    linux-vdso.so.1 (0x00007ffc1bdea000)
    libstdc++.so.6 => /opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 (0x00007f4fda6e5000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f4fda39a000)
    librccl.so.1 => /opt/rocm-5.6.0/lib/librccl.so.1 (0x00007f4fc93e0000)
    librocsolver.so.0 => /opt/rocm-5.6.0/lib/librocsolver.so.0 (0x00007f4f7c944000)
....
i-chaochen commented 5 months ago

Thanks for the fix! May I ask what's your jax version and how did you test RCCL?

RCCL was there but recent XLA is having rapid changes on many collective parts. For example in the following, rccl related parts are removed, back and changed in a very short period...

  1. remove it https://github.com/openxla/xla/commit/d3eda6f1031c32650abea9e2a78b47fb23f65b51#diff-321e6254f7b7c81c8d35e8f03813117cfd3cbb23bc5a04aaf2b7a4b618d90582R18-R32
  2. rollback https://github.com/openxla/xla/commit/ef66a6be2ee14f580006cc29f78fa2b7aaa2c366#diff-321e6254f7b7c81c8d35e8f03813117cfd3cbb23bc5a04aaf2b7a4b618d90582R39-R49
  3. remove it again https://github.com/openxla/xla/commit/d3eda6f1031c32650abea9e2a78b47fb23f65b51#diff-321e6254f7b7c81c8d35e8f03813117cfd3cbb23bc5a04aaf2b7a4b618d90582R29-R30
  4. complete refactoring nccl_utils https://github.com/openxla/xla/commit/d06a70e5fa7965f391c61d97c3bb6886642511ee#diff-34945ddae255735b8c069c23cf053c7fb479c9de9217aebe942316e6b1330ca3

considering the situation, we are monitoring the refactoring and once it's stable we add it back at once.

jglaser commented 5 months ago

Hi @i-chaochen ... this was tested using a local application which was working with jax 0.4.4, the last version that I was successfully able to use with ROCm. I remember I did have to search for a working combination of jax branches and tensorflow forks.... there was no documentation or recommendation what is current and what is not. After that, jax underwent many changes from tensorflow to openxla and stopped working for a while. jax and xla version were latest HEADs of main as of time of the PR (see commit id above). Luckily, it looks like most ROCm related changes have been merged into jax/xla upstream, which makes it much easier to work from the official repositories. As mentioned in the PR, the only thing that's missing is a ROCm CI so that things don't keep breaking....

i-chaochen commented 5 months ago

we're working on the upstream, but sometimes their changes are too frequent and the procedure of merged PR is not always as good as expected.

I recommend you to use our release jax, and it will be easier for you. https://github.com/ROCmSoftwarePlatform/jax/releases

jglaser commented 5 months ago

we're working on the upstream, but sometimes their changes are too frequent and the procedure of merged PR is not always as good as expected.

I recommend you to use our release jax, and it will be easier for you. https://github.com/ROCmSoftwarePlatform/jax/releases

Good to know --- on a side note, I can't use the latest release version here because I am actually interested in a recent feature (jax.experimental.shard_alike)