Closed fzago-cray closed 7 years ago
@fzago-cray Thanks for reporting this. Seems, the verbs/RDM provider is weak and can be tolerance with this event. I guess it should handle and try resolve once again.
But there is a workaround. I've prepared the patch (#3460) that introduces the FI_VERBS_RDM_RESOLVE_ADDR_TIMEOUT
env variable (in microseconds). Could you play with this variable, please? The default value is 30 000
ms.
Sorry I'm not pretty familiar with SLURM job manager. Is my understandings correct that you command runs only 90 slurm task (MPI ranks)? How the processes pin down to nodes for your execution? Will it be 2 task per node always or slurm can run it randomly w/o any correct pinning?
The task breaks in about 10 seconds, so this timeout (or the others) is not triggered.
That test runs 90 ranks on 45 nodes, so 2 ranks per node. AFAIK workload managers always try to balance tasks. I think the ranks are randomly distributed.
If I run the test with 15 nodes only, it passes all the time. MPIR_CVAR_OFI_USE_PROVIDER=verbs srun -N 15 osu_alltoall
With one more rank (16 nodes), it fails most of the time.
If I run the test with 15 nodes only, it passes all the time. MPIR_CVAR_OFI_USE_PROVIDER=verbs srun -N 15 osu_alltoall
With one more rank (16 nodes), it fails most of the time.
Only 16 nodes? It's very bad result. But we need to know a number of tasks per node in this case. alltoall will consume (N-1) connections for each MPI task, where N is number of tasks.
There is 1 task per node in that case.
There is 1 task per node in that case.
Thanks. It's very bad :( I need to reproduce this and playing with that.
Have you tried to play with timeout for addr resolution?
No, since the timeout is set to 30 seconds, and the task breaks under 10.
I added more traces. It appears that although rdma_resolve_addr() is successful, the reply is that address error code:
179934:libfabric:verbs:av:fi_ibv_rdm_start_connection():84<info> Attempt to start connection with addr 10.13.1.2:42586
179934:libfabric:verbs:av:fi_ibv_rdm_process_event():864<info> got unexpected rdmacm event, RDMA_CM_EVENT_ADDR_ERROR
179934:libfabric:verbs:ep_data:fi_ibv_rdm_cm_progress_thread():218<info> fi_ibv_rdm_cm_progress error
/usr/bin/sh: line 1: 179934 Aborted (core dumped) ./mpi/collective/osu_alltoall
That seems to point to an IB address resolution error.
Is it possible to retry the address resolution a few times? Same for route resolution.
Is it possible to retry the address resolution a few times? Same for route resolution.
Yes, it's possible. I'm going to implement it
Just out of curiosity, How does the RxM/verbs
provider feel itself with this test?
Thanks.
MPICH asserts when using the RxM/verbs provider, because rxm doesn't provide what it needs. Last time we checked, ofi_rxm didn't support FI_MULTI_RECV and FI_ATOMIC yet which MPICH ch4:ofi requires. There might be more.
ch4:ofi
I don't know that you're using ch4 now. Thanks!
Underlying cause was an ib misconfiguration, with unreachable nodes. Sorry. Working on that, I made a couple cleanups for which I'll make a PR.
Underlying cause was an ib misconfiguration, with unreachable nodes. Sorry. Working on that, I made a couple cleanups for which I'll make a PR.
Thanks for investigation. Sure, It would be great 👍
When running:
I hit an abort() on a few nodes:
The EP gets a RDMA_CM_EVENT_ADDR_ERROR event, which trickles down to fi_ibv_rdm_cm_progress as an EADDRNOTAVAIL error.
I don't see why I get a RDMA_CM_EVENT_ADDR_ERROR event, but that could be because of the hardware setup, or a weak SM or ...?
Anyway, libfabric shouldn't abort. Ideally it could retry the connection it was doing, since that test sometimes passes. Or fail gracefully. If I remove the abort(), then the task appear to hang, possibly looping somewhere.