helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
212 stars 53 forks source link

[Bug]: `test_random` fails on AMD GPU #1682

Open ClaudiaComito opened 1 month ago

ClaudiaComito commented 1 month ago

What happened?

Our tests on the AMD-ROCm runner have been failing at test_random, on the 2-process GPU tests.

Failure corresponds to one of the many dndarray.numpy() calls, in turn calling Allgather or Allgatherv.

Code snippet triggering the error

No response

Error message or erroneous outcome

No response

Version

main (development branch)

Python version

None

PyTorch version

None

MPI version

No response