Open benmenadue opened 3 years ago
@benmenadue For now, we'll continue working on it through Mellanox support as an HCOLL issue
@benmenadue is that possible to get reproducer? How many nodes are used in the run?
You should be able to reproduce this by running https://github.com/BenWibking/quokka/blob/development/scripts/shell-64nodes.pbs (without the flag to disable hcoll multicast, of course). This was a 64 node run, 4x GPUs per node.
is it possible to run the app on CPU? is it only reproduced when running on GPU?
It can be run on CPU as well. I haven't tried running at that scale on CPU. The crashes also appear to be somewhat nondeterministic.
Describe the bug
Some applications are failing with a segfault in hcoll callback
mcast_ucx_recv_completion_cb
. Reported to Mellanox Support (since that's part of hcoll, case 00956842), and they suggested opening this here as well. Traceback isSteps to Reproduce
For the above traceback:
mpirun -np 256 --map-by numa:SPAN --bind-to numa --mca pml ucx ...
Setup and versions
ibv_devinfo -vv
attached)ibv_devinfo.txt