openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.1k stars 415 forks source link

Vector datatype MPI send/recv fails with rc or dc but not tcp or ud for host buffers #5168

Open Akshay-Venkatesh opened 4 years ago

Akshay-Venkatesh commented 4 years ago

Describe the bug

The bug was originally reported to not occur with UCX 1.7 but I've not confirmed this. I'm able to repro on 1.8.x and master

Steps to Reproduce

Makefile and source here: https://gist.github.com/Akshay-Venkatesh/e33a9088ec9cfeb2cd052c106dabd821

mpirun --host prm-dgx-21,prm-dgx-22 -np 4 --npernode 2 --oversubscribe --mca pml ucx --mca coll ^hcoll --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=n -x UCX_TLS=rc ucx-vector-bug/main

Setup and versions

Output

Passing case:

mpirun --host prm-dgx-21,prm-dgx-22 -np 4 --npernode 2 --oversubscribe --mca pml ucx --mca coll ^hcoll --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=n -x UCX_TLS=ud ucx-vector-bug/main
rank = 3, size = 4
rank = 2, size = 4
rank = 0, size = 4
rank = 1, size = 4
pass

Failing case:

mpirun --host prm-dgx-21,prm-dgx-22 -np 4 --npernode 2 --oversubscribe --mca pml ucx --mca coll ^hcoll --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=n -x UCX_TLS=rc ucx-vector-bug/main

rank = 0, size = 4
rank = 1, size = 4
rank = 2, size = 4
rank = 3, size = 4
main: ../../../opal/datatype/opal_datatype_unpack.h:97: unpack_predefined_data: Assertion `0 == (cando_count % _elem->blocklen)' failed.
[prm-dgx-21:42920] *** Process received signal ***
[prm-dgx-21:42920] Signal: Aborted (6)
[prm-dgx-21:42920] Signal code:  (-6)
[prm-dgx-21:42920] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fe8bf8e7f20]
[prm-dgx-21:42920] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fe8bf8e7e97]
[prm-dgx-21:42920] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fe8bf8e9801]
[prm-dgx-21:42920] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7fe8bf8d939a]
[prm-dgx-21:42920] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7fe8bf8d9412]
[prm-dgx-21:42920] [ 5] $OMPI_HOME/lib/libopen-pal.so.40(+0x69b91)[0x7fe8bf286b91]
[prm-dgx-21:42920] [ 6] $OMPI_HOME/lib/libopen-pal.so.40(+0x6b129)[0x7fe8bf288129]
[prm-dgx-21:42920] [ 7] $OMPI_HOME/lib/libopen-pal.so.40(opal_generic_simple_unpack+0x9c3)[0x7fe8bf288bad]
[prm-dgx-21:42920] [ 8] $OMPI_HOME/lib/libopen-pal.so.40(opal_convertor_unpack+0x2c6)[0x7fe8bf272c79]
[prm-dgx-21:42920] [ 9] $OMPI_HOME/lib/openmpi/mca_pml_ucx.so(+0xa617)[0x7fe8adfda617]
[prm-dgx-21:42920] [10] $UCX_HOME/lib/libucp.so.0(ucp_rndv_data_handler+0x306)[0x7fe8add7bd86]
[prm-dgx-21:42920] [11] $UCX_HOME/lib/ucx/libuct_ib.so.0(+0x6ba73)[0x7fe8ac76ca73]
[prm-dgx-21:42920] [12] $UCX_HOME/lib/libucp.so.0(ucp_worker_progress+0x7a)[0x7fe8add4be6a]
[prm-dgx-21:42920] [13] $OMPI_HOME/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_recv+0x24a)[0x7fe8adfd6124]
[prm-dgx-21:42920] [14] $OMPI_HOME/lib/libmpi.so.40(MPI_Recv+0x2da)[0x7fe8c00e4a83]
[prm-dgx-21:42920] [15] ucx-vector-bug/main(+0xef4)[0x561b4f12bef4]
[prm-dgx-21:42920] [16] ucx-vector-bug/main(+0x1119)[0x561b4f12c119]
[prm-dgx-21:42920] [17] ucx-vector-bug/main(+0x1315)[0x561b4f12c315]
[prm-dgx-21:42920] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fe8bf8cab97]
[prm-dgx-21:42920] [19] ucx-vector-bug/main(+0xc7a)[0x561b4f12bc7a]
[prm-dgx-21:42920] *** End of error message ***
yosefe commented 4 years ago

@Akshay-Venkatesh seems like send type and recv type are created with different 'count' (1024 vs 512), is this expected?

yosefe commented 4 years ago

i don't think OpenMPI supports it, and it works with tcp/ud only because their segment size is a power of 2

Akshay-Venkatesh commented 4 years ago

@Akshay-Venkatesh seems like send type and recv type are created with different 'count' (1024 vs 512), is this expected?

@yosefe sorry missed this earlier.

The send side stride and receive side strides are different. Send side is stride of 512 elements and receive side has a stride of 1024 elements but the sizes in terms of bytes match at both ends and so do the corresponding basic element types. Basically (512 sizeof(double) ) 512 bytes at both ends but receive side has gaps because stride is 1024 between 512-blocklengths. Also, MPI semantics does permit this usage if I'm not mistaken. page 112, line 28 of the MPI 3.1 states the following:

Type matching is defined according to the type signature of the corresponding datatypes, that is, the sequence of basic type components. Type matching does not depend on some aspects of the datatype definition, such as the displacements (layout in memory) or the intermediate types used.

Please correct me if you think I'm mistaken here.

Whether this was stride difference was intended by the user is something I can confirm. The user who reported this issue also reported the following cases work:

yosefe commented 4 years ago

@Akshay-Venkatesh this is because the segment size has changed with UCX v1.8.0. Anyway, i think this is an OpenMPI issue and not UCX.

bosilca commented 4 years ago

@yosefe taking in account that OMPI works with several flavors of PML (including an older UCX version), the issue might be on UCX side. More specifically, this error indicates that a partial data has been received, something that the datatype tries hard to prevent. So, either some bytes got lost, or the sender is doing a fragmentation that is not coming from the OMPI datatype engine.

yosefe commented 4 years ago

@bosilca i think what happens here is that receive stride is larger than sender stride, so the packer side packs by a 512-byte granularity (for example), but the receive expects to unpack by 1024-byte granularity, and fails. Does OpenMPI support such case?

bosilca commented 4 years ago

Definitively, as @Akshay-Venkatesh mentionned earlier, communications with datatypes with the same type signature, aka. same sequence of predefined types ignoring the memory displacement, is an MPI requirement. In fact, in terms of communication MPI forbids little: truncation (receiving less than sent), and type conversion (send a float and receiving an int as an example).

Vectors with different strides is really one of the simplest cases. to help understand what is not going on correctly, have your OMPI build in debug mode and provide the following MCA params, mpi_ddt_unpack_debug and mpi_ddt_pack_debug set to a non-zero value. You should get additional info about how the pack/unpack works on both sides.

Akshay-Venkatesh commented 4 years ago

@bosilca Added debug output of the test from pack/unpack here: https://gist.github.com/Akshay-Venkatesh/e33a9088ec9cfeb2cd052c106dabd821#file-ddt-debug-txt