ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
573 stars 381 forks source link

prov/verbs: abort in in fi_ibv_rdm_cm_progress_thread #3458

Closed fzago-cray closed 7 years ago

fzago-cray commented 7 years ago

When running:

$ MPIR_CVAR_OFI_USE_PROVIDER=verbs srun -n 90 -N 45 osu_alltoall

I hit an abort() on a few nodes:

(gdb) bt
#0  0x00007ffff3ea65f7 in raise () from /lib64/libc.so.6
#1  0x00007ffff3ea7ce8 in abort () from /lib64/libc.so.6
#2  0x00007ffff39de4c7 in fi_ibv_rdm_cm_progress_thread (dom=0x6a7690) at prov/verbs/src/verbs_domain.c:219
#3  0x00007ffff7bc6dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff3f6728d in lseek64 () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) list
214             ep = container_of(item, struct fi_ibv_rdm_ep,
215                       list_entry);
216             if (fi_ibv_rdm_cm_progress(ep)) {
217                 VERBS_INFO (FI_LOG_EP_DATA,
218                             "fi_ibv_rdm_cm_progress error\n");
219                 abort();
220             }
221         }
222         usleep(domain->rdm_cm->cm_progress_timeout);
223     }

The EP gets a RDMA_CM_EVENT_ADDR_ERROR event, which trickles down to fi_ibv_rdm_cm_progress as an EADDRNOTAVAIL error.

I don't see why I get a RDMA_CM_EVENT_ADDR_ERROR event, but that could be because of the hardware setup, or a weak SM or ...?

Anyway, libfabric shouldn't abort. Ideally it could retry the connection it was doing, since that test sometimes passes. Or fail gracefully. If I remove the abort(), then the task appear to hang, possibly looping somewhere.

dmitrygx commented 7 years ago

@fzago-cray Thanks for reporting this. Seems, the verbs/RDM provider is weak and can be tolerance with this event. I guess it should handle and try resolve once again. But there is a workaround. I've prepared the patch (#3460) that introduces the FI_VERBS_RDM_RESOLVE_ADDR_TIMEOUT env variable (in microseconds). Could you play with this variable, please? The default value is 30 000 ms.

Sorry I'm not pretty familiar with SLURM job manager. Is my understandings correct that you command runs only 90 slurm task (MPI ranks)? How the processes pin down to nodes for your execution? Will it be 2 task per node always or slurm can run it randomly w/o any correct pinning?

fzago-cray commented 7 years ago

The task breaks in about 10 seconds, so this timeout (or the others) is not triggered.

That test runs 90 ranks on 45 nodes, so 2 ranks per node. AFAIK workload managers always try to balance tasks. I think the ranks are randomly distributed.

fzago-cray commented 7 years ago

If I run the test with 15 nodes only, it passes all the time. MPIR_CVAR_OFI_USE_PROVIDER=verbs srun -N 15 osu_alltoall

With one more rank (16 nodes), it fails most of the time.

dmitrygx commented 7 years ago

If I run the test with 15 nodes only, it passes all the time. MPIR_CVAR_OFI_USE_PROVIDER=verbs srun -N 15 osu_alltoall

With one more rank (16 nodes), it fails most of the time.

Only 16 nodes? It's very bad result. But we need to know a number of tasks per node in this case. alltoall will consume (N-1) connections for each MPI task, where N is number of tasks.

fzago-cray commented 7 years ago

There is 1 task per node in that case.

dmitrygx commented 7 years ago

There is 1 task per node in that case.

Thanks. It's very bad :( I need to reproduce this and playing with that.

Have you tried to play with timeout for addr resolution?

fzago-cray commented 7 years ago

No, since the timeout is set to 30 seconds, and the task breaks under 10.

fzago-cray commented 7 years ago

I added more traces. It appears that although rdma_resolve_addr() is successful, the reply is that address error code:

179934:libfabric:verbs:av:fi_ibv_rdm_start_connection():84<info> Attempt to start connection with addr 10.13.1.2:42586
179934:libfabric:verbs:av:fi_ibv_rdm_process_event():864<info> got unexpected rdmacm event, RDMA_CM_EVENT_ADDR_ERROR
179934:libfabric:verbs:ep_data:fi_ibv_rdm_cm_progress_thread():218<info> fi_ibv_rdm_cm_progress error
/usr/bin/sh: line 1: 179934 Aborted                 (core dumped) ./mpi/collective/osu_alltoall

That seems to point to an IB address resolution error.

Is it possible to retry the address resolution a few times? Same for route resolution.

dmitrygx commented 7 years ago

Is it possible to retry the address resolution a few times? Same for route resolution.

Yes, it's possible. I'm going to implement it

Just out of curiosity, How does the RxM/verbs provider feel itself with this test?

fzago-cray commented 7 years ago

Thanks.

MPICH asserts when using the RxM/verbs provider, because rxm doesn't provide what it needs. Last time we checked, ofi_rxm didn't support FI_MULTI_RECV and FI_ATOMIC yet which MPICH ch4:ofi requires. There might be more.

dmitrygx commented 7 years ago

ch4:ofi

I don't know that you're using ch4 now. Thanks!

fzago-cray commented 7 years ago

Underlying cause was an ib misconfiguration, with unreachable nodes. Sorry. Working on that, I made a couple cleanups for which I'll make a PR.

dmitrygx commented 7 years ago

Underlying cause was an ib misconfiguration, with unreachable nodes. Sorry. Working on that, I made a couple cleanups for which I'll make a PR.

Thanks for investigation. Sure, It would be great 👍