Correctness failure when using BTL RDMA

vspetrov commented 7 years ago

Thank you for taking the time to submit an issue!

Background information

Possibly related to the "patcher" memory framework

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v2.0.x v2.x master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from sources, cloned from github

Please describe the system on which you are running

Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
Computer hardware: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Network type: Mellanox infiniband, CIB

Details of the problem

Multithreaded correctness test (attached mt_stress.zip ) fails with OMPI. Reproduced on 2 nodes.

shell$ mpirun -np 3 --map-by node -mca pml ob1 -mca btl openib,self   `nif 50`    -mca coll ^hcoll ./mt_stress 1497004394

...
Splitting id 124
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 2, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 0, count 15045, comm_size 3, color 1
CORRECTNESS ERROR: id 124, TEST_TYPE 2, pos 3692, value 3, expected 6, dtype MPI_INT, root 0, rank 1, count 15045, comm_size 3, color 1
Splitting id 125
Splitting id 126
...

This is an allreduce failure. After some debug i narrowed it down to the single p2p inside allreduce. One ranks sends the data to the other side, but the data is received corrupted for some reason.

The test would pass if "-mca mpi_leave_pinned 0" OR if the ompi is built without memory manager support (--without-memory-manager). This is why my suspicion goes to "patcher" memory framework.

Additionally, the same issues are observed with pml yalla (mellanox mxm based p2p). Again disabling mem notifications (MXM_MEM_ON_DEMAND_MAP=n) helps.

Since "patcher" was not present in ompi_v1.10 i wanted to try test with that version. btl openib wouldn't work since it didn't support mpi_thread_multiple in 1.10 however, pml yalla works w/o errors with 1.10.

jladd-mlnx commented 7 years ago

@markalle @jjhursey @gpaulsen is this something you've observed in your testing?

gpaulsen commented 7 years ago

Yes, we're struggling with some issues around patcher right now as well on ppc64le. Still trying to root cause.

jladd-mlnx commented 7 years ago

Thanks, @gpaulsen. This was actually reported to us by an IBM test engineer who thought it was an HCOLL issue. @vspetrov 's report is what we've been able to make of it - that it's most likely patcher related and not an HCOLL bug.

alsrgv commented 7 years ago

Thanks @vspetrov for reporting this issue!

I am seeing exactly the same issue with MPI_Allreduce, except in my case all MPI calls are done in single thread (but otherwise application is multi-threaded and uses jemalloc). I also was able to trace it down to single p2p call where correct data is sent and corrupted data is received.

It does repro with OpenIB and does not repro with TCP. It does not repro with -mca mpi_leave_pinned 0 or --without-memory-manager, as above.

Unfortunately, with -mca mpi_leave_pinned 0 performance with OpenIB is worse than with TCP, so it defeats the purpose of RDMA.

I also tried OpenMPI 1.10.7 and it works correctly. Performance is not as good as 2.1.1 with memory manager that gives corruption, but definitely better than 2.1.1 without pinned memory.

I have ring-reduce based implementation on pure MPI_Send()/MPI_Recv() that works well and does not lead to corruption, but I'm worried that it may do so in certain circumstances. Are there any pointers what's special about MPI_Allreduce() and why it corrupts the data?

alsrgv commented 7 years ago

UPDATE: I discovered that I can repro this issue with my ring-reduce implementation too if I modify it to use malloc'd buffer. Currently it's using pinned memory buffer and avoids corruption.

jsquyres commented 7 years ago

Per discussion on 1 Aug Webex:

This issue was originally reported with MXM+patcher
@alsrgv reported with openib (assumedly +patcher)
IBM has seen similar problems in PAMI+patcher.

@vspetrov Does this happen with v3.0.x? (we're assuming you didn't test v3.0.x -- can you clarify?)

@hjelmn Can you chime in here? It's apparently reproducible with openib, and a test case was attached in the original description.