Problems with cuda aware MPI and Omnipath networks

edgargabriel commented 6 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.1.0rc2 and master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

Operating system/version: Redhat 7.3
Computer hardware:
Network type: Omnipath with libpsm2-10.2.235

Details of the problem

We have a user code that is able that makes use of cuda-aware MPI features for direct data transfer across multiple GPUs. The code has utilized successfully fairly large InfiniBand clusters. We face however a problem when executing it on our Omnipath cluster

@bosilca pointed out to me the following commit

https://github.com/open-mpi/ompi/commit/2e83cf15ce790f89c782b6222253ab18252a7d2f

which is the reason we turned to the 3.1 release candidate, since this commit is part of this version.

The good news is, that using ompi 3.1.0.rc2, the code runs correctly on a single node /multi GPU environment. Running the code on multiple nodes and with multiple GPUs still fails however. A simple benchmark was able identify that direct transfer from GPU memory across multiple nodes works correctly up to a certain message length, but fails if the message length exceeds a threshold. The error message comes directly form the psm2 library, and is attached below.

My question is now, whether there is a minimum psm2 library version that is required to make this feature work correctly. Our cluster uses currently libpsm2-10.2.235, and there are obviously newer versions out there (newest one being 10.3.37 I think).

As a side note, we did manage to make the code work by using the verbs API and disabling cuda_async_recv, e.g.

mpirun --mca btl_openib_cuda_async_recv 0  --mca pml ob1 -np 2 ...

but this slows down the communication performance quite a bit compared to using the psm2 library directly.

compute-0-37.local.20421test: Reading from remote process' memory failed. Disabling CMA support
compute-0-37.local.20422test: Reading from remote process' memory failed. Disabling CMA support
compute-0-37.local.20421Assertion failure at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-10.2.235/ptl_am/ptl.c:152: nbytes == req->recv_msglen
compute-0-37.local.20422Assertion failure at /nfs/site/home/phcvs2/gitrepo/ifs-all/Ofed_Delta/rpmbuild/BUILD/libpsm2-10.2.235/ptl_am/ptl.c:152: nbytes == req->recv_msglen
[compute-0-37:20421] *** Process received signal ***
[compute-0-37:20421] Signal: Aborted (6)
[compute-0-37:20421] Signal code:  (-6)
[compute-0-37:20422] *** Process received signal ***
[compute-0-37:20422] Signal: Aborted (6)
[compute-0-37:20422] Signal code:  (-6)
[compute-0-37:20421] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7fcdaaaf3370]
[compute-0-37:20421] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fcda9d341d7]
[compute-0-37:20421] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fcda9d358c8]
[compute-0-37:20421] [ 3] /lib64/libpsm2.so.2(+0x14b18)[0x7fcd823e9b18]
[compute-0-37:20421] [ 4] /lib64/libpsm2.so.2(+0xe6f9)[0x7fcd823e36f9]
[compute-0-37:20421] [ 5] /lib64/libpsm2.so.2(psm2_mq_irecv2+0x59a)[0x7fcd823f088a]

rhc54 commented 6 years ago

@matcabral

matcabral commented 6 years ago

Hi @edgargabriel, I'll be looking at this. The libpsm2 version you have is new enough, the minimum is 10.2.175. See, https://www.open-mpi.org/faq/?category=runcuda.

Would you please share what is the message size you see this abort?

matcabral commented 6 years ago

@edgargabriel Please confirm that the PSM2 library is homogeneously built on all nodes with CUDA support. Based on the error message you shared, it suggests the library may not have CUDA support:

...libpsm2-10.2.235/ptl_am/ptl.c:152: nbytes == req->recv_msglen

https://github.com/intel/opa-psm2/blob/PSM2_10.2-235/ptl_am/ptl.c#L152

edgargabriel commented 6 years ago

It is definitely homogeneous, it is the version taken from the Open HPC roll. At the bare minimum, it is identical on all GPU nodes, but I would suspect that it is in fact the same on all compute nodes.

Would you recommend that I recompile libpsm2? And if yes, is it possible to have multiple versions of libpsm2 on the system?

matcabral commented 6 years ago

I would like to start by confirming that PSM2 has cuda support. If you add to your mpirun command -x PSM2_IDENTIFY you should see <host-name>.<pid> PSM2_IDENTIFY PSM2 v2.1-cuda . Alternatively, non official way to find it without running is grep cudaMemcpy /lib64/libpsm2.so. If matches, you are fine. Note that there no warranties this "alternative" method will work in the future. If no support is there, you will have to build. See the instructions at: https://github.com/intel/opa-psm2 If you will be building, it would be advisable to indeed choose a newer version.

Yes, you can have multiple versions of libpsm2 in the system. Just make sure to set LD_LIBRARY_PATH accordingly: mpirun ... -x LD_LIBRARY_PATH=<my_custom_libpsm2_path>

edgargabriel commented 6 years ago

@matcabral I will try to get the information, my job is currently queued. I will also try to compile in parallel a new version of the psm2 library,. Thanks for you help!

edgargabriel commented 6 years ago

I think you probably right, our PSM2 library does not have CUDA support built in. Not entirely clear to me how any of the tests worked in that case. Anyway, I will try to compile a new version of psm2 with CUDA support, and will let you know.

matcabral commented 6 years ago

Not entirely clear to me how any of the tests worked in that case.

OMPI has a native CUDA suport. So it should work even with other transports (e.g. sockets, but I have not tested it). However, the PSM2 CUDA support in OMPI expects you have libpsm2 with CUDA support. Unexpected results if you mix. Maybe there are some non CUDA buffers sent ? However, when you effectively use PSM2 CUDA in OMPI (OMPI CUDA build) with a libpsm2 CUDA build, you will get a significant performance boost.

edgargabriel commented 6 years ago

This might be off topic for this item (and I would be happy to discuss it offline), but I have problems compiling psm2 with CUDA support. Without CUDA support the library compiles without any issues, the moment I set PSM_CUDA=1 I get however error message related to undefined symbols and structures, e.g.

In file included from /home/egabriel/opa-psm2/opa/opa_time.c:70:0:
/home/egabriel/opa-psm2/opa/../include/opa_user.h: In function âhfi_update_tidâ:
/home/egabriel/opa-psm2/opa/../include/opa_user.h:811:26: error: storage size of âtidinfoâ isnât known
  struct hfi1_tid_info_v2 tidinfo;
/home/egabriel/opa-psm2/opa/opa_service.c: In function â_hfi_cmd_ioctlâ:
/home/egabriel/opa-psm2/opa/opa_service.c:346:34: error: âHFI1_IOCTL_TID_UPDATE_V2â undeclared (first use in this function)
  [PSMI_HFI_CMD_TID_UPDATE_V2] = {HFI1_IOCTL_TID_UPDATE_V2 , 0},

I searched google for solutions but I could not find anything. I could also not find those symbols in the linux kernel (e.g. kernel-source//include/uapi/rdma/hfi/ or similar). Any ideas/hints on what am I missing?

matcabral commented 6 years ago

Quick answer: to achieve the zero-copy transfers, libpsm2 uses a special version of the hfi1 drivers (OPA HFI driver). The driver headers you have available most likely don't have CUDA support. As you noticed, you will need the hfi1 driver with CUDA support loaded in the system. Please allow me to search where these details are publicly are posted.

matcabral commented 6 years ago

@edgargabriel are you using the Intel® Omni-Path Fabric Software package? https://downloadcenter.intel.com/download/27335/Intel-Omni-Path-Fabric-Software-Including-Intel-Omni-Path-Host-Fabric-Interface-Driver-?v=t

This is in fact the simplest way to get this setup. See the install guide: https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Fabric_Software_IG_H76467_v8_1.pdf

I suspect your nodes already satisfy NVIDIA software requirements section 4.4. Then proceed to 5.1.1 "./INSTALL -G" (Install GPUDirect* components). This will install the libpsm2 and hfi1 drivers with CUDA support, and in addition an build OMPI with CUDA support at /usr/mpi/gcc/openmpi-2.1.2-cuda-hfi/.

However, if you still want to build. The source rpms for all the components are also included.

matcabral commented 6 years ago

Hi @edgargabriel, any news ?

edgargabriel commented 6 years ago

@matcabral : our system administrators performed the update of the OPA stack on the cluster to include CUDA aware packages. It took a while since it is a production system, but it is finally done. I ran a couple of tests on Monday, but I still face some problems. although the error message are now different. I will try to gather the precise cases and error messages.

edgargabriel commented 6 years ago

@matcabral: before I post the error messages, I would like to clarify one point. The new software stack that is installed on the system does how CUDA support compiled into it. I can verify that two ways, a) I can successfully compile my psm2 library using PSM_CUDA=1 (which I could not before) and b) if I ran

[egabriel@compute-0-37 lib64]$ grep cudaMemcpy libpsm2.so
Binary file libpsm2.so matches

which it did not report before. However, if I use the first method that you report, I still get an error message:

[egabriel@compute-0-37 ~]$ mpirun -x PSM2_IDENTIFY -np 2 ./main-osc
[compute-0-37.local:19532] Warning: could not find environment variable "PSM2_IDENTIFY"

Is that ok, or might this point to a problem ?

matcabral commented 6 years ago

-x PSM2_IDENTIFY=1 my bad :flushed:

Note that that hfi1 driver binary loaded must also be the CUDA one. modinfo hfi1

edgargabriel commented 6 years ago

ok, this looks better, thanks :-)

[egabriel@compute-0-39 ~]$ mpirun -x PSM2_IDENTIFY=1 -np 2 ./main-osc
compute-0-39.local.3281 PSM2_IDENTIFY PSM2 v2.1-cuda
compute-0-39.local.3281 PSM2_IDENTIFY location /usr/lib64/libpsm2.so.2
compute-0-39.local.3281 PSM2_IDENTIFY build date 2017-10-25 22:45:44+00:00
compute-0-39.local.3281 PSM2_IDENTIFY src checksum 4a3b39b93920ff4b7cb95ec90a1ff6d6df07d111
compute-0-39.local.3281 PSM2_IDENTIFY git checksum 61c8d25f4d7248c12cbdab63671a5bd237e81321
compute-0-39.local.3281 PSM2_IDENTIFY built against driver interface v6.3
compute-0-39.local.3280 PSM2_IDENTIFY PSM2 v2.1-cuda
compute-0-39.local.3280 PSM2_IDENTIFY location /usr/lib64/libpsm2.so.2
compute-0-39.local.3280 PSM2_IDENTIFY build date 2017-10-25 22:45:44+00:00
compute-0-39.local.3280 PSM2_IDENTIFY src checksum 4a3b39b93920ff4b7cb95ec90a1ff6d6df07d111
compute-0-39.local.3280 PSM2_IDENTIFY git checksum 61c8d25f4d7248c12cbdab63671a5bd237e81321
compute-0-39.local.3280 PSM2_IDENTIFY built against driver interface v6.3

edgargabriel commented 6 years ago

First, the scenario that I am working with right now is one node, two GPUs, two MPI processes, each MPI process uses one GPU.

I have three test cases, (and once I can figure out how to upload the code to github I am happy to provide them). I am not excluding the possibility that something is wrong in my test cases.

Test case 1: p2p data transfer. Sender buffer is host memory, receive buffer is GPU memory. Test case works. Output looks as follows:

N1 >> on local device 1 on host compute-0-37.local
N0 >> on local device 0 on host compute-0-37.local
1: length=1 >> working
0: length=1 >> working
1: length=2 >> working
0: length=2 >> working
1: length=4 >> working
0: length=4 >> working
1: length=8 >> working
0: length=8 >> working
1: length=16 >> working
0: length=16 >> working
1: length=32 >> working
0: length=32 >> working
1: length=64 >> working
0: length=64 >> working
1: length=128 >> working
0: length=128 >> working
1: length=256 >> working
0: length=256 >> working
1: length=512 >> working
0: length=512 >> working
1: length=1024 >> working
0: length=1024 >> working
1: length=2048 >> working
0: length=2048 >> working
1: length=4096 >> working
0: length=4096 >> working
1: length=8192 >> working
0: length=8192 >> working
1: length=16384 >> working
0: length=16384 >> working
1: length=32768 >> working
0: length=32768 >> working
1: length=65536 >> working
0: length=65536 >> working

Test case 2: p2p data transfer, sender buffer is GPU memory, receiver buffer is GPU memory. Test case does not work, it segfaults right away.

N1 >> with local device 1 on host compute-0-37.local
N0 >> with local device 0 on host compute-0-37.local
[compute-0-37:18612] *** Process received signal ***
[compute-0-37:18612] Signal: Segmentation fault (11)
[compute-0-37:18612] Signal code: Invalid permissions (2)
[compute-0-37:18612] Failing at address: 0x7f2695600000
[compute-0-37:18612] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f26bcec6370]
[compute-0-37:18612] [ 1] /usr/lib64/libpsm2.so.2(+0x64ed)[0x7f26b0bba4ed]
[compute-0-37:18612] [ 2] /usr/lib64/libpsm2.so.2(+0xb2ac)[0x7f26b0bbf2ac]
[compute-0-37:18612] [ 3] /usr/lib64/libpsm2.so.2(psm2_mq_send2+0x39)[0x7f26b0bd2739]
[compute-0-37:18612] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7f26a246bed7]
[compute-0-37:18612] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7f269998c60e]
[compute-0-37:18612] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7f269998dc2b]
[compute-0-37:18612] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7f2699dcfc77]
[compute-0-37:18612] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7f26bd387b1d]
[compute-0-37:18612] [ 9] ./main-p2p-2[0x403679]
[compute-0-37:18612] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f26bc0f3b35]
[compute-0-37:18612] [11] ./main-p2p-2[0x403379]
[compute-0-37:18612] *** End of error message ***
[compute-0-37:18613] *** Process received signal ***
[compute-0-37:18613] Signal: Segmentation fault (11)
[compute-0-37:18613] Signal code: Invalid permissions (2)
[compute-0-37:18613] Failing at address: 0x7f112e000000
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[compute-0-37:18613] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f1162cae370]
[compute-0-37:18613] [ 1] /usr/lib64/libpsm2.so.2(+0x64ed)[0x7f11528d74ed]
[compute-0-37:18613] [ 2] /usr/lib64/libpsm2.so.2(+0xb2ac)[0x7f11528dc2ac]
[compute-0-37:18613] [ 3] /usr/lib64/libpsm2.so.2(psm2_mq_send2+0x39)[0x7f11528ef739]
[compute-0-37:18613] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7f1148324ed7]
[compute-0-37:18613] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7f114784660e]
[compute-0-37:18613] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7f1147847c2b]
[compute-0-37:18613] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7f1147c89c77]
[compute-0-37:18613] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7f116316fb1d]
[compute-0-37:18613] [ 9] ./main-p2p-2[0x403679]
[compute-0-37:18613] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1161edbb35]
[compute-0-37:18613] [11] ./main-p2p-2[0x403379]
[compute-0-37:18613] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-37 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Test case 3: one-sided communication using MPI_Put, both local and remote memory are in GPU buffer. Result: test case works up to a certain message length.

N0 >> on local device 0 on host compute-0-37.local
N1 >> on local device 1 on host compute-0-37.local
0: length=1 >> working
1: length=1 >> working
0: length=2 >> working
1: length=2 >> working
0: length=4 >> working
1: length=4 >> working
0: length=8 >> working
1: length=8 >> working
0: length=16 >> working
1: length=16 >> working
0: length=32 >> working
1: length=32 >> working
0: length=64 >> working
1: length=64 >> working
0: length=128 >> working
1: length=128 >> working
0: length=256 >> working
1: length=256 >> working
0: length=512 >> working
1: length=512 >> working
[compute-0-37:18800] *** Process received signal ***
[compute-0-37:18801] *** Process received signal ***
[compute-0-37:18801] Signal: Segmentation fault (11)
[compute-0-37:18801] Signal code: Invalid permissions (2)
[compute-0-37:18801] Failing at address: 0x7f6e01c00000
[compute-0-37:18800] Signal: Segmentation fault (11)
[compute-0-37:18800] Signal code: Invalid permissions (2)
[compute-0-37:18800] Failing at address: 0x7fab08800000
[compute-0-37:18801] [ 0] [compute-0-37:18800] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f6e2167a370]
[compute-0-37:18801] [ 1] /usr/lib64/libpthread.so.0(+0xf370)[0x7fab325ae370]
[compute-0-37:18800] [ 1] /usr/lib64/libpsm2.so.2(+0x4d118)[0x7fab262e9118]
[compute-0-37:18800] *** End of error message ***
/usr/lib64/libpsm2.so.2(+0x4d118)[0x7f6e153b5118]
[compute-0-37:18801] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-37 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Note, that the length is number of elements of type MPI_DOUBLE, not number of bytes.

matcabral commented 6 years ago

both cases should work. You may confirm with OSU MPI benchmarks that have CUDA support: http://mvapich.cse.ohio-state.edu/benchmarks/ . NOTE that OMPI does NOT yet support CUDA on non blocking collectives: https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-no-cuda

edgargabriel commented 6 years ago

well, the situation is pretty much the same. If I run an osu benchmark directly using psm2, I get the same error message, if I tell mpirun to switch to ob1 everything works even from device memory.

[egabriel@compute-0-39 pt2pt]$ mpirun -np 2 ./osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[compute-0-39:08750] *** Process received signal ***
[compute-0-39:08750] Signal: Segmentation fault (11)
[compute-0-39:08750] Signal code: Invalid permissions (2)
[compute-0-39:08750] Failing at address: 0x7efd0fc00000
[compute-0-39:08750] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7efd4b878370]
[compute-0-39:08750] [ 1] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(+0x8929)[0x7efd29b46929]
[compute-0-39:08750] [ 2] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(+0xa343)[0x7efd29b48343]
[compute-0-39:08750] [ 3] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(psm2_mq_send2+0x2d)[0x7efd29b5975d]
[compute-0-39:08750] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7efd1b786ed7]
[compute-0-39:08750] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7efd12a3160e]
[compute-0-39:08750] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7efd12a32c2b]
[compute-0-39:08750] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7efd12e74c77]
[compute-0-39:08750] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7efd4c351b1d]
[compute-0-39:08750] [ 9] ./osu_bw[0x401f1e]
[compute-0-39:08750] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7efd4b4c9b35]
[compute-0-39:08750] [11] ./osu_bw[0x40227b]
[compute-0-39:08750] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-39 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[egabriel@compute-0-39 pt2pt]$ mpirun --mca pml ob1  -np 2 ./osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.10
2                       0.21
4                       0.41
8                       0.82
16                      1.65
32                      3.31
64                      6.62
128                    13.19
256                    26.37
512                    51.83
1024                  105.54
2048                  211.30
4096                  425.78
8192                  855.59
16384                1705.02
32768                3423.73
65536                6806.44
131072              13724.85
262144              27181.63
524288              53698.34
1048576            107921.84
2097152            213597.32
4194304            195706.50
[egabriel@compute-0-39 pt2pt]$

matcabral commented 6 years ago

[compute-0-39:08750] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7efd12a32c2b]

I see that this is using the OFI MTL which does not have CUDA support. You should use the PSM2 MTL (I'm surprised this is not selected by default.... ) mpirun -mca pml cm -mca mtl psm2 ....

I assume you OMPI does have CUDA support, right?
ompi_info |grep -i cuda

edgargabriel commented 6 years ago

@matcabral yes, it is compiled with cuda support, and forcing using the psm2 mtl made the osu benchmark work! That is good news, thanks!

Some of my own test cases are now also working, but a few still fail with a new error message:

[egabriel@compute-0-39 ~]$ mpirun --mca pml cm --mca mtl psm2 -np 2 ./main-osc
N0 >> on local device 0 on host compute-0-39.local 
N1 >> on local device 1 on host compute-0-39.local 
0: length=1 >> working
1: length=1 >> working
0: length=2 >> working
1: length=2 >> working
0: length=4 >> working
1: length=4 >> working
0: length=8 >> working
1: length=8 >> working
0: length=16 >> working
1: length=16 >> working
0: length=32 >> working
1: length=32 >> working
0: length=64 >> working
1: length=64 >> working
0: length=128 >> working
1: length=128 >> working
0: length=256 >> working
1: length=256 >> working
0: length=512 >> working
1: length=512 >> working
compute-0-39.local.10043main-osc: Check if cuda runtime is initializedbefore psm2_ep_open call 
compute-0-39.local.10043main-osc: CUDA failure: cudaEventRecord() (at /home/egabriel/opa-psm2/ptl_am/ptl.c:98)returned 33
compute-0-39.local.10043Error returned from CUDA function.

I will try to follow up on that tomorrow, Thanks for your help! I will keep you posted.

edgargabriel commented 6 years ago

output of ompi_info

[egabriel@compute-0-39 ~]$ ompi_info | grep -i cuda
                  Prefix: /brazos/gabriel/OpenMPI-CUDA
  Configure command line: '--prefix=/brazos/gabriel/OpenMPI-CUDA' '-with-cuda=/project/cacds/apps/easybuild/software/CUDA/9.1.85/' '--enable-debug'
          MPI extensions: affinity, cuda
                 MCA btl: smcuda (MCA v2.1.0, API v3.0.0, Component v3.1.0)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v3.1.0)

edgargabriel commented 6 years ago

good news, I have a slightly modified version of my testcases working as well. I will try to find some time in the next couple of days to distill why precisely my original version didn't work ( in my opinion it should), but for now I am glad we have got it working. I will also still have to test multi-node cases, but not tonight.

@matcabral thank you for your help!

matcabral commented 6 years ago

compute-0-39.local.10043main-osc: Check if cuda runtime is initializedbefore psm2_ep_open call compute-0-39.local.10043main-osc: CUDA failure: cudaEventRecord() (at /home/egabriel/opa-psm2/ptl_am/ptl.c:98)returned 33 compute-0-39.local.10043Error returned from CUDA function.

This seems to be a GPU affinity issue. libpsm2 is initialized during MPI_Init() time and setting GPU affinity by default to device 0. If you try to change it after MPI_Init() will give the above error. Solution, do cudaSetDevice before MPI_Init.

edgargabriel commented 6 years ago

@matcabral thank you. You were right, I did the cudaSetDevice after MPI_Init (although the original user code did that before MPI_Init), and I can confirm that this resolved the last issue.

To resolve the psm2 over ofi selection problem, I increased the psm2 priority in the mca-params.conf file, this seems to do the trick for now. I think the problem stems from this code sniplet in ompi_mtl_psm2_component_register:

    if (num_local_procs == num_total_procs) {
        /* disable hfi if all processes are local */
        setenv("PSM2_DEVICES", "self,shm", 0);
        /* ob1 is much faster than psm2 with shared memory */
        param_priority = 10;
    } else {
        param_priority = 40;
    }

I am still waiting to hear back from the user whether his application also ran successfully. I will close the ticket however, can always reopen if there are other issues. Thanks!

matcabral commented 6 years ago

Hi @edgargabriel,

Right, you are probably running all ranks locally. This piece of code was thought to favor vader btl over libpsm2 shm device: doing memcpy higher in the stack is more efficient ;) . I will look to add an #ifndef OPAL_CUDA_SUPPORT in there.

thanks!

edgargabriel commented 6 years ago

@matcabral I am afraid I have to reopen this issue. Our user is running into some new error messages. Just to recap, this is using Open MPI 3.1.0, psm2-10.3-8 with cuda 9.1.85. Basically, the job aborts after some time. He was able to boil it down to a testcase with 2 nodes, 4 GPUs (two per node), and the error message is as follows:

compute-0-41.local.22165bluebottle: CUDA failure: cudaIpcOpenMemHandle() (at /project/cacds/build/opa-psm2-PSM2_10.3-8/ptl_am/am_cuda_memhandle_cache.c:281)returned 30
compute-0-41.local.22165Error returned from CUDA function.

[compute-0-41:22165] *** Process received signal ***
[compute-0-41:22165] Signal: Aborted (6)
[compute-0-41:22165] Signal code:  (-6)
[compute-0-41:22165] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7fbcc70a3370]
[compute-0-41:22165] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fbcc6d081d7]
[compute-0-41:22165] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fbcc6d098c8]
[compute-0-41:22165] [ 3] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0x14e86)[0x7fbcb738ae86]
[compute-0-41:22165] [ 4] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0x152dc)[0x7fbcb738b2dc]
[compute-0-41:22165] [ 5] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0xe67a)[0x7fbcb738467a]
[compute-0-41:22165] [ 6] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0xcd23)[0x7fbcb7382d23]
[compute-0-41:22165] [ 7] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(psm2_mq_irecv2+0x321)[0x7fbcb7391ae1]
[compute-0-41:22165] [ 8] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_irecv+0xa8)[0x7fbcab3c
6c38]
[compute-0-41:22165] [ 9] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_pml_cm.so(mca_pml_cm_start+0xaf)[0x7fbcac00f09f]
[compute-0-41:22165] [10] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_irecv_w_cb+0x55)[0x7
fbca993a105]
[compute-0-41:22165] [11] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_component_irecv+0x39
)[0x7fbca993bb69]
[compute-0-41:22165] [12] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(+0x14181)[0x7fbca993c181]
[compute-0-41:22165] [13] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_receive+0x15
e)[0x7fbca993d7be]
[compute-0-41:22165] [14] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(+0x10c04)[0x7fbca9938c04]
[compute-0-41:22165] [15] /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fbcc61246bc]
[compute-0-41:22165] [16] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_pml_cm.so(+0x299d)[0x7fbcac00a99d]
[compute-0-41:22165] [17] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_coll_basic.so(mca_coll_basic_reduce_scatter_bloc
k_intra+0x179)[0x7fbcab1bf1f9]
[compute-0-41:22165] [18] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_coll_cuda.so(mca_coll_cuda_reduce_scatter_block+
0xd4)[0x7fbcaa16bba4]
[compute-0-41:22165] [19] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_fence+0xde)[0x7fbca9
93f38e]
[compute-0-41:22165] [20] /project/cacds/apps/openmpi/3.1.0/gcc/lib/libmpi.so.40(MPI_Win_fence+0x71)[0x7fbcc7550781]
[compute-0-41:22165] [21] ./bluebottle[0x45002e]
[compute-0-41:22165] [22] ./bluebottle[0x432445]
[compute-0-41:22165] [23] ./bluebottle[0x434070]
[compute-0-41:22165] [24] ./bluebottle[0x403b3f]
[compute-0-41:22165] [25] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbcc6cf4b35]
[compute-0-41:22165] [26] ./bluebottle[0x403eaf]
[compute-0-41:22165] *** End of error message ***

Any ideas on how to debug it? I tried to install the newest psm2 library version to see whether the problem is solved by doing that, but unfortunately that version does not compile on our cluster because of some errors stemming from the new gprcpy feature.

aravindksg commented 6 years ago

Hi @edgargabriel , from the error log you provided, it looks like error itself was from CUDA API failing. This could be some CUDA API runtime issue (Googling around for the error code seems to indicate as much). Could you please confirm that there is no discrepancy in the CUDA runtime version vs. the CUDA driver APIs? If there is some mismatch, it is likely you will see CUDA call fails. (nvidia-smi should give you info about driver versions etc.).

Beyond that, PSM2 did have some CUDA related fixes in newer versions of the library. So, it might be that using a newer version of libpsm2 and hfi1 driver resolves the problems. (The PSM2 version you are using is more than 6 months old which was originally tested with CUDA 8)

Regarding any compile issues with new libpsm2 due to gdrcopy feature: you will need the latest hfi1 driver component as well. So, the easiest way to get all the relevant updates would be through IFS-10.7

Link to latest install guide: IFS install guide

The following command should work by upgrading currently installed PSM2 CUDA components- ./INSTALL -U -G

edgargabriel commented 6 years ago

@aravindksg thank you for your feedback. By discrepancy between CUDA runtime version and CUDA drivers, are you referring to version mismatch between the libcudart.so file used and the client side CUDA libraries? IF yes, then this is clearly not the case. I double checked our LD_LIBRARY_PATH, and there is no other directory that could accidentally be loaded from. In addition, we have a gazillion of non-MPI jobs that run correctly in this setting, if it is a version mismatch, I would expect that some of them would also fail.

Regarding the update of the software, I will trigger that with our administrators. Since this is a production cluster, that can take however a while (the hfi component can not be updated by a regular user as far as I can see, libpsm2 would be possible however).

matcabral commented 6 years ago

Hi @edgargabriel,

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

cudaErrorUnknown = 30 This indicates that an unknown internal error has occurred.

Despite the fact this does not say much, suggests something going wrong on the CUDA stack. So, it seems like the next logical step would be to scope and try to reproduce the issue:

Stack components versions.
Does any of your non MPI workloads use cudaIpcOpenMemHandle()? still these would use just one node, but may suggest if there could be an issue there.

edgargabriel commented 6 years ago

An update on this item:

our system administrators did update the hfi and psm2 stacks on the gpu nodes. I can confirm that I can also compile now the most recent psm2 libraries myself, without having the compilation problem with the gprcopy function that I had previously reported. Unfortunately, the error message remains exactly the same with the new psm2 library.

regarding the version mismatch: I did dig around a bit, but I cannot find where a version mismatch between the individual cuda libraries could come from. As a user, we load the cuda module that we would like to use, e.g. CUDA/9.1.85 in this case, and this sets the LD_LIBRARY_PATH up to load the corresponding libraries from this directory.

The is another directory that might be accessed and which is set through the /etc/ld.so.conf.d/ mechanism. All libraries in this directory stem from a single rpm package, namely xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64. I could find in the internet or Nvidia webpages anything about incompatibilities between libraries in this rpm and any cuda release. In addition, by looking at the output of the ldd command for the application, I do not see any of the libraries from that directory being used.

        linux-vdso.so.1 =>  (0x00007ffc359ef000)
        libm.so.6 => /lib64/libm.so.6 (0x00007ffb69c40000)
        libcgns.so.3.3 => /project/cacds/1033_wdaniel/CGNS/lib/libcgns.so.3.3 (0x00007ffb6997c000)
        libhdf5.so.101 => /project/cacds/apps/HDF5/1.10.1-gcc-openmpi/lib/libhdf5.so.101 (0x00007ffb693af000)
        libcudart.so.9.1 => /project/cacds/apps/easybuild/software/CUDA/9.1.85/lib64/libcudart.so.9.1 (0x00007ffb69141000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ffb68e39000)
        libmpi.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libmpi.so.40 (0x00007ffb68b3c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffb68926000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb6870a000)
        libc.so.6 => /lib64/libc.so.6 (0x00007ffb68349000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffb69f42000)
        libz.so.1 => /lib64/libz.so.1 (0x00007ffb68133000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007ffb67f2f000)
        librt.so.1 => /lib64/librt.so.1 (0x00007ffb67d27000)
        libopen-rte.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-rte.so.40 (0x00007ffb67a72000)
        libopen-pal.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-pal.so.40 (0x00007ffb67768000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007ffb6755c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007ffb67359000)

And not sure whether this is helpful, but here are all the rpm's that I found on the node that have either cuda or nvidia in the name:

[egabriel@sabine ~]$ grep -i cuda rpmout.1 
kmod-ifs-kernel-updates-3.10.0_514.44.1.el7.x86_64-1514cuda.x86_64
libpsm2-10.3.35-1cuda.x86_64
libpsm2-compat-10.3.35-1cuda.x86_64
ifs-kernel-updates-devel-3.10.0_514.44.1.el7.x86_64-1514cuda.x86_64
mpitests_openmpi_gcc_cuda_hfi-3.2-931.x86_64
cuda-drivers-390.30-1.x86_64
libpsm2-devel-10.3.35-1cuda.x86_64
openmpi_gcc_cuda_hfi-2.1.2-18.el7.x86_64

[egabriel@sabine ~]$ grep -i nvidia rpmout.1 
xorg-x11-drv-nvidia-gl-390.30-1.el7.x86_64
nvidia-kmod-390.30-2.el7.x86_64
xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64
xorg-x11-drv-nvidia-390.30-1.el7.x86_64
pcp-pmda-nvidia-gpu-3.11.3-4.el7.x86_64
xorg-x11-drv-nvidia-libs-390.30-1.el7.x86_64

Anyway, any suggestions what precisely to look for would be appreciated, I am at this point out of ideas.

matcabral commented 6 years ago

Hi @edgargabriel, all I can think at this point is finding different workloads that use cudaIpcOpenMemHandle(), or even one not using MPI, to see how it behaves. In addition, if your workload is publicly available, we could try to reproduce it on our side.

edgargabriel commented 6 years ago

@matcabral I ran the simpleIPC test from the cuda9.1.85/samples directory. This test uses cudaIpcOpenMemHandle(), is as far as I understand however only designed for running on one node. It used both GPUs, and finished as far as I can tell correctly. I am still looking for an example that we could run across multiple nodes.

Regarding the code, the application is called bluebottle, and you can download it from github. https://github.com/groundcherry/bluebottle-3.0/tree/devel

There are very good instructions for compiling the application (requires HDF5 and CGNS in addition to MPI). I can send you also the smalles testcase that the user could produce which reliably failed on our cluster. It requires 2 nodes with 4 GPUs total. I would however prefer to send you the dataset by email or provide you the link separately, if that is ok.

rwmcguir commented 6 years ago

So the application links against the libraries shown above (libcudart.so.9.1), but there is another library at play here from nvidia, example from 9.1 on a SLES 12.3 machine. (This is also running 9.1.85 FYI).

/usr/lib64/libcuda.so.1 -> libcuda.so.387.26 This is what they call the userspace driver library, and what typically returns the 30 error code back into the libcudart.so (runtime). you can normally see this versions also, from running $ nvidia-smi nvidia-smi Wed Jun 20 16:09:57 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 ..... .... +-----------------------------------------------------------------------------+

Any mismatches in runtime vs drivers are typically what cause the unknown error. I see that the RPM list you have installed is running a lot of stock in distro nvidia drivers. Potentially maybe there is some conflict between real nvidia drivers+sw and the in-distro versions?
i.e. nvidia-kmod-390.30-2.el7.x86_64 seems newer, maybe to new. I would be expecting 387.26, but perhaps that is due to SLES vs RHEL Please double check this.

edgargabriel commented 6 years ago

@rwmcguir : I went to the Nvidia webpage, when you try to download the RHEL rpm's for CUDA 9.1, there is a big message showing up saying:

"Before installing the CUDA Toolkit on Linux, please ensure that you have the latest NVIDIA driver R390 installed. The latest NVIDIA R390 driver is available at: www.nvidia.com/drivers", see

https://developer.nvidia.com/cuda-91-download-archive?target_os=Linux&target_arch=x86_64&target_distro=RHEL

edgargabriel commented 6 years ago

@rwmcguir I also went to the CUDA driver download pages, and configured download for the drivers and CUDA version that we use, and the recommended driver from Nvidia was 390.46

Version: | 390.46
Release Date: | 2018.3.28
Operating System: | Linux 64-bit
CUDA Toolkit: | 9.1
Language: | English (US)
File Size: | 78.33 MB

matcabral commented 5 years ago

@edgargabriel, I found this is still open. I understand libpsm2 updated substantial number of things in 11.x versions. Do you still see this issue?

edgargabriel commented 5 years ago

@matcabral I think we can close it for now. Not sure whether we still see the issue or not, the last CUDA and PSM update that we did was in November. We found however a configuration on some (new) hardware that worked for this user ( single node with 8 GPUs), and because of that we haven't looked into this lately. Thanks!

paboyle commented 5 years ago

Is this a duplicate of this issue?

https://github.com/intel/opa-psm2/issues/41

and

https://github.com/open-mpi/ompi/issues/6799

Looks like it was never got to the conclusion, but might be same effect.

open-mpi / ompi