Open JuanPedroGHM opened 3 months ago
@JuanPedroGHM That's interesting. Actually, I have just used vmap on up to 12 GPU-nodes without any problems. Is this problem related to OpenMPI >= 4.1 specifically?
No, this is with OpenMPI 4.1 and mpi4py 3.1.6. I updated the description of the issue with the specific dependencies and configuration where it fails.
This issue is stale because it has been open for 60 days with no activity.
What happened?
When running on more than one node and using GPUs at the same time,
test_vmap
fails. Needs further investigation.Code snippet triggering the error
When running the test on Horeka using accelerated nodes, the test fails when running the test on 2 Nodes, with 3 or 4 ranks each.
HEAT_TEST_USE_DEVICE=gpu mpirun --report-bindings -N 3/4 pytest heat/core/tests/test_vmap.py
Error message or erroneous outcome
The result of the test does not match the expected outcome.
Version
main (development branch)
Python version
3.11.2
PyTorch version
2.2.2
Cuda version
12.2
MPI version