helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
212 stars 53 forks source link

[Bug]: `test_vmap` fails on multi-node runs on hardware accelerators #1627

Open JuanPedroGHM opened 3 months ago

JuanPedroGHM commented 3 months ago

What happened?

When running on more than one node and using GPUs at the same time, test_vmap fails. Needs further investigation.

Code snippet triggering the error

When running the test on Horeka using accelerated nodes, the test fails when running the test on 2 Nodes, with 3 or 4 ranks each.

HEAT_TEST_USE_DEVICE=gpu mpirun --report-bindings -N 3/4 pytest heat/core/tests/test_vmap.py

Error message or erroneous outcome

The result of the test does not match the expected outcome.

FAILED heat/core/tests/test_vmap.py::TestVmap::test_vmap - AssertionError: False is not true

Version

main (development branch)

Python version

3.11.2

PyTorch version

2.2.2

Cuda version

12.2

MPI version

OpenMPI 4.1, 5.0
mpi4py 3.1.6, 4.0.0
mrfh92 commented 3 months ago

@JuanPedroGHM That's interesting. Actually, I have just used vmap on up to 12 GPU-nodes without any problems. Is this problem related to OpenMPI >= 4.1 specifically?

JuanPedroGHM commented 3 months ago

No, this is with OpenMPI 4.1 and mpi4py 3.1.6. I updated the description of the issue with the specific dependencies and configuration where it fails.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 60 days with no activity.