Closed edgargabriel closed 5 years ago
@matcabral
Hi @edgargabriel, I'll be looking at this. The libpsm2 version you have is new enough, the minimum is 10.2.175. See, https://www.open-mpi.org/faq/?category=runcuda.
Would you please share what is the message size you see this abort?
@edgargabriel Please confirm that the PSM2 library is homogeneously built on all nodes with CUDA support. Based on the error message you shared, it suggests the library may not have CUDA support:
...libpsm2-10.2.235/ptl_am/ptl.c:152: nbytes == req->recv_msglen
https://github.com/intel/opa-psm2/blob/PSM2_10.2-235/ptl_am/ptl.c#L152
It is definitely homogeneous, it is the version taken from the Open HPC roll. At the bare minimum, it is identical on all GPU nodes, but I would suspect that it is in fact the same on all compute nodes.
Would you recommend that I recompile libpsm2? And if yes, is it possible to have multiple versions of libpsm2 on the system?
I would like to start by confirming that PSM2 has cuda support. If you add to your mpirun command -x PSM2_IDENTIFY
you should see <host-name>.<pid> PSM2_IDENTIFY PSM2 v2.1-cuda
. Alternatively, non official way to find it without running is grep cudaMemcpy /lib64/libpsm2.so
. If matches, you are fine. Note that there no warranties this "alternative" method will work in the future.
If no support is there, you will have to build. See the instructions at:
https://github.com/intel/opa-psm2
If you will be building, it would be advisable to indeed choose a newer version.
Yes, you can have multiple versions of libpsm2 in the system. Just make sure to set LD_LIBRARY_PATH accordingly: mpirun ... -x LD_LIBRARY_PATH=<my_custom_libpsm2_path>
@matcabral I will try to get the information, my job is currently queued. I will also try to compile in parallel a new version of the psm2 library,. Thanks for you help!
I think you probably right, our PSM2 library does not have CUDA support built in. Not entirely clear to me how any of the tests worked in that case. Anyway, I will try to compile a new version of psm2 with CUDA support, and will let you know.
Not entirely clear to me how any of the tests worked in that case.
OMPI has a native CUDA suport. So it should work even with other transports (e.g. sockets, but I have not tested it). However, the PSM2 CUDA support in OMPI expects you have libpsm2 with CUDA support. Unexpected results if you mix. Maybe there are some non CUDA buffers sent ? However, when you effectively use PSM2 CUDA in OMPI (OMPI CUDA build) with a libpsm2 CUDA build, you will get a significant performance boost.
This might be off topic for this item (and I would be happy to discuss it offline), but I have problems compiling psm2 with CUDA support. Without CUDA support the library compiles without any issues, the moment I set PSM_CUDA=1 I get however error message related to undefined symbols and structures, e.g.
In file included from /home/egabriel/opa-psm2/opa/opa_time.c:70:0:
/home/egabriel/opa-psm2/opa/../include/opa_user.h: In function âhfi_update_tidâ:
/home/egabriel/opa-psm2/opa/../include/opa_user.h:811:26: error: storage size of âtidinfoâ isnât known
struct hfi1_tid_info_v2 tidinfo;
/home/egabriel/opa-psm2/opa/opa_service.c: In function â_hfi_cmd_ioctlâ:
/home/egabriel/opa-psm2/opa/opa_service.c:346:34: error: âHFI1_IOCTL_TID_UPDATE_V2â undeclared (first use in this function)
[PSMI_HFI_CMD_TID_UPDATE_V2] = {HFI1_IOCTL_TID_UPDATE_V2 , 0},
I searched google for solutions but I could not find anything. I could also not find those symbols in the linux kernel (e.g. kernel-source//include/uapi/rdma/hfi/ or similar). Any ideas/hints on what am I missing?
Quick answer: to achieve the zero-copy transfers, libpsm2 uses a special version of the hfi1 drivers (OPA HFI driver). The driver headers you have available most likely don't have CUDA support. As you noticed, you will need the hfi1 driver with CUDA support loaded in the system. Please allow me to search where these details are publicly are posted.
@edgargabriel are you using the Intel® Omni-Path Fabric Software package? https://downloadcenter.intel.com/download/27335/Intel-Omni-Path-Fabric-Software-Including-Intel-Omni-Path-Host-Fabric-Interface-Driver-?v=t
This is in fact the simplest way to get this setup. See the install guide: https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Fabric_Software_IG_H76467_v8_1.pdf
I suspect your nodes already satisfy NVIDIA software requirements section 4.4. Then proceed to 5.1.1 "./INSTALL -G" (Install GPUDirect* components). This will install the libpsm2 and hfi1 drivers with CUDA support, and in addition an build OMPI with CUDA support at /usr/mpi/gcc/openmpi-2.1.2-cuda-hfi/.
However, if you still want to build. The source rpms for all the components are also included.
Hi @edgargabriel, any news ?
@matcabral : our system administrators performed the update of the OPA stack on the cluster to include CUDA aware packages. It took a while since it is a production system, but it is finally done. I ran a couple of tests on Monday, but I still face some problems. although the error message are now different. I will try to gather the precise cases and error messages.
@matcabral: before I post the error messages, I would like to clarify one point. The new software stack that is installed on the system does how CUDA support compiled into it. I can verify that two ways, a) I can successfully compile my psm2 library using PSM_CUDA=1 (which I could not before) and b) if I ran
[egabriel@compute-0-37 lib64]$ grep cudaMemcpy libpsm2.so
Binary file libpsm2.so matches
which it did not report before. However, if I use the first method that you report, I still get an error message:
[egabriel@compute-0-37 ~]$ mpirun -x PSM2_IDENTIFY -np 2 ./main-osc
[compute-0-37.local:19532] Warning: could not find environment variable "PSM2_IDENTIFY"
Is that ok, or might this point to a problem ?
-x PSM2_IDENTIFY=1
my bad :flushed:
Note that that hfi1 driver binary loaded must also be the CUDA one. modinfo hfi1
ok, this looks better, thanks :-)
[egabriel@compute-0-39 ~]$ mpirun -x PSM2_IDENTIFY=1 -np 2 ./main-osc
compute-0-39.local.3281 PSM2_IDENTIFY PSM2 v2.1-cuda
compute-0-39.local.3281 PSM2_IDENTIFY location /usr/lib64/libpsm2.so.2
compute-0-39.local.3281 PSM2_IDENTIFY build date 2017-10-25 22:45:44+00:00
compute-0-39.local.3281 PSM2_IDENTIFY src checksum 4a3b39b93920ff4b7cb95ec90a1ff6d6df07d111
compute-0-39.local.3281 PSM2_IDENTIFY git checksum 61c8d25f4d7248c12cbdab63671a5bd237e81321
compute-0-39.local.3281 PSM2_IDENTIFY built against driver interface v6.3
compute-0-39.local.3280 PSM2_IDENTIFY PSM2 v2.1-cuda
compute-0-39.local.3280 PSM2_IDENTIFY location /usr/lib64/libpsm2.so.2
compute-0-39.local.3280 PSM2_IDENTIFY build date 2017-10-25 22:45:44+00:00
compute-0-39.local.3280 PSM2_IDENTIFY src checksum 4a3b39b93920ff4b7cb95ec90a1ff6d6df07d111
compute-0-39.local.3280 PSM2_IDENTIFY git checksum 61c8d25f4d7248c12cbdab63671a5bd237e81321
compute-0-39.local.3280 PSM2_IDENTIFY built against driver interface v6.3
First, the scenario that I am working with right now is one node, two GPUs, two MPI processes, each MPI process uses one GPU.
I have three test cases, (and once I can figure out how to upload the code to github I am happy to provide them). I am not excluding the possibility that something is wrong in my test cases.
N1 >> on local device 1 on host compute-0-37.local
N0 >> on local device 0 on host compute-0-37.local
1: length=1 >> working
0: length=1 >> working
1: length=2 >> working
0: length=2 >> working
1: length=4 >> working
0: length=4 >> working
1: length=8 >> working
0: length=8 >> working
1: length=16 >> working
0: length=16 >> working
1: length=32 >> working
0: length=32 >> working
1: length=64 >> working
0: length=64 >> working
1: length=128 >> working
0: length=128 >> working
1: length=256 >> working
0: length=256 >> working
1: length=512 >> working
0: length=512 >> working
1: length=1024 >> working
0: length=1024 >> working
1: length=2048 >> working
0: length=2048 >> working
1: length=4096 >> working
0: length=4096 >> working
1: length=8192 >> working
0: length=8192 >> working
1: length=16384 >> working
0: length=16384 >> working
1: length=32768 >> working
0: length=32768 >> working
1: length=65536 >> working
0: length=65536 >> working
N1 >> with local device 1 on host compute-0-37.local
N0 >> with local device 0 on host compute-0-37.local
[compute-0-37:18612] *** Process received signal ***
[compute-0-37:18612] Signal: Segmentation fault (11)
[compute-0-37:18612] Signal code: Invalid permissions (2)
[compute-0-37:18612] Failing at address: 0x7f2695600000
[compute-0-37:18612] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f26bcec6370]
[compute-0-37:18612] [ 1] /usr/lib64/libpsm2.so.2(+0x64ed)[0x7f26b0bba4ed]
[compute-0-37:18612] [ 2] /usr/lib64/libpsm2.so.2(+0xb2ac)[0x7f26b0bbf2ac]
[compute-0-37:18612] [ 3] /usr/lib64/libpsm2.so.2(psm2_mq_send2+0x39)[0x7f26b0bd2739]
[compute-0-37:18612] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7f26a246bed7]
[compute-0-37:18612] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7f269998c60e]
[compute-0-37:18612] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7f269998dc2b]
[compute-0-37:18612] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7f2699dcfc77]
[compute-0-37:18612] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7f26bd387b1d]
[compute-0-37:18612] [ 9] ./main-p2p-2[0x403679]
[compute-0-37:18612] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f26bc0f3b35]
[compute-0-37:18612] [11] ./main-p2p-2[0x403379]
[compute-0-37:18612] *** End of error message ***
[compute-0-37:18613] *** Process received signal ***
[compute-0-37:18613] Signal: Segmentation fault (11)
[compute-0-37:18613] Signal code: Invalid permissions (2)
[compute-0-37:18613] Failing at address: 0x7f112e000000
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[compute-0-37:18613] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f1162cae370]
[compute-0-37:18613] [ 1] /usr/lib64/libpsm2.so.2(+0x64ed)[0x7f11528d74ed]
[compute-0-37:18613] [ 2] /usr/lib64/libpsm2.so.2(+0xb2ac)[0x7f11528dc2ac]
[compute-0-37:18613] [ 3] /usr/lib64/libpsm2.so.2(psm2_mq_send2+0x39)[0x7f11528ef739]
[compute-0-37:18613] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7f1148324ed7]
[compute-0-37:18613] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7f114784660e]
[compute-0-37:18613] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7f1147847c2b]
[compute-0-37:18613] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7f1147c89c77]
[compute-0-37:18613] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7f116316fb1d]
[compute-0-37:18613] [ 9] ./main-p2p-2[0x403679]
[compute-0-37:18613] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1161edbb35]
[compute-0-37:18613] [11] ./main-p2p-2[0x403379]
[compute-0-37:18613] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-37 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
N0 >> on local device 0 on host compute-0-37.local
N1 >> on local device 1 on host compute-0-37.local
0: length=1 >> working
1: length=1 >> working
0: length=2 >> working
1: length=2 >> working
0: length=4 >> working
1: length=4 >> working
0: length=8 >> working
1: length=8 >> working
0: length=16 >> working
1: length=16 >> working
0: length=32 >> working
1: length=32 >> working
0: length=64 >> working
1: length=64 >> working
0: length=128 >> working
1: length=128 >> working
0: length=256 >> working
1: length=256 >> working
0: length=512 >> working
1: length=512 >> working
[compute-0-37:18800] *** Process received signal ***
[compute-0-37:18801] *** Process received signal ***
[compute-0-37:18801] Signal: Segmentation fault (11)
[compute-0-37:18801] Signal code: Invalid permissions (2)
[compute-0-37:18801] Failing at address: 0x7f6e01c00000
[compute-0-37:18800] Signal: Segmentation fault (11)
[compute-0-37:18800] Signal code: Invalid permissions (2)
[compute-0-37:18800] Failing at address: 0x7fab08800000
[compute-0-37:18801] [ 0] [compute-0-37:18800] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f6e2167a370]
[compute-0-37:18801] [ 1] /usr/lib64/libpthread.so.0(+0xf370)[0x7fab325ae370]
[compute-0-37:18800] [ 1] /usr/lib64/libpsm2.so.2(+0x4d118)[0x7fab262e9118]
[compute-0-37:18800] *** End of error message ***
/usr/lib64/libpsm2.so.2(+0x4d118)[0x7f6e153b5118]
[compute-0-37:18801] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-37 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Note, that the length is number of elements of type MPI_DOUBLE, not number of bytes.
both cases should work. You may confirm with OSU MPI benchmarks that have CUDA support: http://mvapich.cse.ohio-state.edu/benchmarks/ . NOTE that OMPI does NOT yet support CUDA on non blocking collectives: https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-no-cuda
well, the situation is pretty much the same. If I run an osu benchmark directly using psm2, I get the same error message, if I tell mpirun to switch to ob1 everything works even from device memory.
[egabriel@compute-0-39 pt2pt]$ mpirun -np 2 ./osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
[compute-0-39:08750] *** Process received signal ***
[compute-0-39:08750] Signal: Segmentation fault (11)
[compute-0-39:08750] Signal code: Invalid permissions (2)
[compute-0-39:08750] Failing at address: 0x7efd0fc00000
[compute-0-39:08750] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7efd4b878370]
[compute-0-39:08750] [ 1] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(+0x8929)[0x7efd29b46929]
[compute-0-39:08750] [ 2] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(+0xa343)[0x7efd29b48343]
[compute-0-39:08750] [ 3] /brazos/gabriel/OPA-PSM2/usr/lib64/libpsm2.so.2(psm2_mq_send2+0x2d)[0x7efd29b5975d]
[compute-0-39:08750] [ 4] /usr/lib64/libfabric/libpsmx2-fi.so(+0x1bed7)[0x7efd1b786ed7]
[compute-0-39:08750] [ 5] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x260e)[0x7efd12a3160e]
[compute-0-39:08750] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7efd12a32c2b]
[compute-0-39:08750] [ 7] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_pml_cm.so(+0x4c77)[0x7efd12e74c77]
[compute-0-39:08750] [ 8] /brazos/gabriel/OpenMPI-CUDA/lib/libmpi.so.40(MPI_Isend+0x2f1)[0x7efd4c351b1d]
[compute-0-39:08750] [ 9] ./osu_bw[0x401f1e]
[compute-0-39:08750] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7efd4b4c9b35]
[compute-0-39:08750] [11] ./osu_bw[0x40227b]
[compute-0-39:08750] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node compute-0-39 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[egabriel@compute-0-39 pt2pt]$ mpirun --mca pml ob1 -np 2 ./osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v5.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.10
2 0.21
4 0.41
8 0.82
16 1.65
32 3.31
64 6.62
128 13.19
256 26.37
512 51.83
1024 105.54
2048 211.30
4096 425.78
8192 855.59
16384 1705.02
32768 3423.73
65536 6806.44
131072 13724.85
262144 27181.63
524288 53698.34
1048576 107921.84
2097152 213597.32
4194304 195706.50
[egabriel@compute-0-39 pt2pt]$
[compute-0-39:08750] [ 6] /brazos/gabriel/OpenMPI-CUDA/lib/openmpi/mca_mtl_ofi.so(+0x3c2b)[0x7efd12a32c2b]
I see that this is using the OFI MTL which does not have CUDA support. You should use the PSM2 MTL (I'm surprised this is not selected by default.... )
mpirun -mca pml cm -mca mtl psm2 ....
I assume you OMPI does have CUDA support, right?
ompi_info |grep -i cuda
@matcabral yes, it is compiled with cuda support, and forcing using the psm2 mtl made the osu benchmark work! That is good news, thanks!
Some of my own test cases are now also working, but a few still fail with a new error message:
[egabriel@compute-0-39 ~]$ mpirun --mca pml cm --mca mtl psm2 -np 2 ./main-osc
N0 >> on local device 0 on host compute-0-39.local
N1 >> on local device 1 on host compute-0-39.local
0: length=1 >> working
1: length=1 >> working
0: length=2 >> working
1: length=2 >> working
0: length=4 >> working
1: length=4 >> working
0: length=8 >> working
1: length=8 >> working
0: length=16 >> working
1: length=16 >> working
0: length=32 >> working
1: length=32 >> working
0: length=64 >> working
1: length=64 >> working
0: length=128 >> working
1: length=128 >> working
0: length=256 >> working
1: length=256 >> working
0: length=512 >> working
1: length=512 >> working
compute-0-39.local.10043main-osc: Check if cuda runtime is initializedbefore psm2_ep_open call
compute-0-39.local.10043main-osc: CUDA failure: cudaEventRecord() (at /home/egabriel/opa-psm2/ptl_am/ptl.c:98)returned 33
compute-0-39.local.10043Error returned from CUDA function.
I will try to follow up on that tomorrow, Thanks for your help! I will keep you posted.
output of ompi_info
[egabriel@compute-0-39 ~]$ ompi_info | grep -i cuda
Prefix: /brazos/gabriel/OpenMPI-CUDA
Configure command line: '--prefix=/brazos/gabriel/OpenMPI-CUDA' '-with-cuda=/project/cacds/apps/easybuild/software/CUDA/9.1.85/' '--enable-debug'
MPI extensions: affinity, cuda
MCA btl: smcuda (MCA v2.1.0, API v3.0.0, Component v3.1.0)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v3.1.0)
good news, I have a slightly modified version of my testcases working as well. I will try to find some time in the next couple of days to distill why precisely my original version didn't work ( in my opinion it should), but for now I am glad we have got it working. I will also still have to test multi-node cases, but not tonight.
@matcabral thank you for your help!
compute-0-39.local.10043main-osc: Check if cuda runtime is initializedbefore psm2_ep_open call compute-0-39.local.10043main-osc: CUDA failure: cudaEventRecord() (at /home/egabriel/opa-psm2/ptl_am/ptl.c:98)returned 33 compute-0-39.local.10043Error returned from CUDA function.
This seems to be a GPU affinity issue. libpsm2 is initialized during MPI_Init() time and setting GPU affinity by default to device 0. If you try to change it after MPI_Init() will give the above error. Solution, do cudaSetDevice before MPI_Init.
@matcabral thank you. You were right, I did the cudaSetDevice after MPI_Init (although the original user code did that before MPI_Init), and I can confirm that this resolved the last issue.
To resolve the psm2 over ofi selection problem, I increased the psm2 priority in the mca-params.conf file, this seems to do the trick for now. I think the problem stems from this code sniplet in ompi_mtl_psm2_component_register:
if (num_local_procs == num_total_procs) {
/* disable hfi if all processes are local */
setenv("PSM2_DEVICES", "self,shm", 0);
/* ob1 is much faster than psm2 with shared memory */
param_priority = 10;
} else {
param_priority = 40;
}
I am still waiting to hear back from the user whether his application also ran successfully. I will close the ticket however, can always reopen if there are other issues. Thanks!
Hi @edgargabriel,
Right, you are probably running all ranks locally. This piece of code was thought to favor vader btl over libpsm2 shm device: doing memcpy higher in the stack is more efficient ;) . I will look to add an #ifndef OPAL_CUDA_SUPPORT
in there.
thanks!
@matcabral I am afraid I have to reopen this issue. Our user is running into some new error messages. Just to recap, this is using Open MPI 3.1.0, psm2-10.3-8 with cuda 9.1.85. Basically, the job aborts after some time. He was able to boil it down to a testcase with 2 nodes, 4 GPUs (two per node), and the error message is as follows:
compute-0-41.local.22165bluebottle: CUDA failure: cudaIpcOpenMemHandle() (at /project/cacds/build/opa-psm2-PSM2_10.3-8/ptl_am/am_cuda_memhandle_cache.c:281)returned 30
compute-0-41.local.22165Error returned from CUDA function.
[compute-0-41:22165] *** Process received signal ***
[compute-0-41:22165] Signal: Aborted (6)
[compute-0-41:22165] Signal code: (-6)
[compute-0-41:22165] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7fbcc70a3370]
[compute-0-41:22165] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fbcc6d081d7]
[compute-0-41:22165] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fbcc6d098c8]
[compute-0-41:22165] [ 3] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0x14e86)[0x7fbcb738ae86]
[compute-0-41:22165] [ 4] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0x152dc)[0x7fbcb738b2dc]
[compute-0-41:22165] [ 5] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0xe67a)[0x7fbcb738467a]
[compute-0-41:22165] [ 6] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(+0xcd23)[0x7fbcb7382d23]
[compute-0-41:22165] [ 7] /project/cacds/apps/psm2/10.3-8-cuda/usr/lib64/libpsm2.so.2(psm2_mq_irecv2+0x321)[0x7fbcb7391ae1]
[compute-0-41:22165] [ 8] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_mtl_psm2.so(ompi_mtl_psm2_irecv+0xa8)[0x7fbcab3c
6c38]
[compute-0-41:22165] [ 9] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_pml_cm.so(mca_pml_cm_start+0xaf)[0x7fbcac00f09f]
[compute-0-41:22165] [10] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_irecv_w_cb+0x55)[0x7
fbca993a105]
[compute-0-41:22165] [11] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_component_irecv+0x39
)[0x7fbca993bb69]
[compute-0-41:22165] [12] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(+0x14181)[0x7fbca993c181]
[compute-0-41:22165] [13] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_receive+0x15
e)[0x7fbca993d7be]
[compute-0-41:22165] [14] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(+0x10c04)[0x7fbca9938c04]
[compute-0-41:22165] [15] /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fbcc61246bc]
[compute-0-41:22165] [16] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_pml_cm.so(+0x299d)[0x7fbcac00a99d]
[compute-0-41:22165] [17] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_coll_basic.so(mca_coll_basic_reduce_scatter_bloc
k_intra+0x179)[0x7fbcab1bf1f9]
[compute-0-41:22165] [18] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_coll_cuda.so(mca_coll_cuda_reduce_scatter_block+
0xd4)[0x7fbcaa16bba4]
[compute-0-41:22165] [19] /project/cacds/apps/openmpi/3.1.0/gcc/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_fence+0xde)[0x7fbca9
93f38e]
[compute-0-41:22165] [20] /project/cacds/apps/openmpi/3.1.0/gcc/lib/libmpi.so.40(MPI_Win_fence+0x71)[0x7fbcc7550781]
[compute-0-41:22165] [21] ./bluebottle[0x45002e]
[compute-0-41:22165] [22] ./bluebottle[0x432445]
[compute-0-41:22165] [23] ./bluebottle[0x434070]
[compute-0-41:22165] [24] ./bluebottle[0x403b3f]
[compute-0-41:22165] [25] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbcc6cf4b35]
[compute-0-41:22165] [26] ./bluebottle[0x403eaf]
[compute-0-41:22165] *** End of error message ***
Any ideas on how to debug it? I tried to install the newest psm2 library version to see whether the problem is solved by doing that, but unfortunately that version does not compile on our cluster because of some errors stemming from the new gprcpy feature.
Hi @edgargabriel , from the error log you provided, it looks like error itself was from CUDA API failing. This could be some CUDA API runtime issue (Googling around for the error code seems to indicate as much). Could you please confirm that there is no discrepancy in the CUDA runtime version vs. the CUDA driver APIs? If there is some mismatch, it is likely you will see CUDA call fails. (nvidia-smi should give you info about driver versions etc.).
Beyond that, PSM2 did have some CUDA related fixes in newer versions of the library. So, it might be that using a newer version of libpsm2 and hfi1 driver resolves the problems. (The PSM2 version you are using is more than 6 months old which was originally tested with CUDA 8)
Regarding any compile issues with new libpsm2 due to gdrcopy feature: you will need the latest hfi1 driver component as well. So, the easiest way to get all the relevant updates would be through IFS-10.7
Link to latest install guide: IFS install guide
The following command should work by upgrading currently installed PSM2 CUDA components- ./INSTALL -U -G
@aravindksg thank you for your feedback. By discrepancy between CUDA runtime version and CUDA drivers, are you referring to version mismatch between the libcudart.so file used and the client side CUDA libraries? IF yes, then this is clearly not the case. I double checked our LD_LIBRARY_PATH, and there is no other directory that could accidentally be loaded from. In addition, we have a gazillion of non-MPI jobs that run correctly in this setting, if it is a version mismatch, I would expect that some of them would also fail.
Regarding the update of the software, I will trigger that with our administrators. Since this is a production cluster, that can take however a while (the hfi component can not be updated by a regular user as far as I can see, libpsm2 would be possible however).
Hi @edgargabriel,
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
cudaErrorUnknown = 30 This indicates that an unknown internal error has occurred.
Despite the fact this does not say much, suggests something going wrong on the CUDA stack. So, it seems like the next logical step would be to scope and try to reproduce the issue:
An update on this item:
regarding the version mismatch: I did dig around a bit, but I cannot find where a version mismatch between the individual cuda libraries could come from. As a user, we load the cuda module that we would like to use, e.g. CUDA/9.1.85 in this case, and this sets the LD_LIBRARY_PATH up to load the corresponding libraries from this directory.
The is another directory that might be accessed and which is set through the /etc/ld.so.conf.d/ mechanism. All libraries in this directory stem from a single rpm package, namely xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64. I could find in the internet or Nvidia webpages anything about incompatibilities between libraries in this rpm and any cuda release. In addition, by looking at the output of the ldd command for the application, I do not see any of the libraries from that directory being used.
linux-vdso.so.1 => (0x00007ffc359ef000)
libm.so.6 => /lib64/libm.so.6 (0x00007ffb69c40000)
libcgns.so.3.3 => /project/cacds/1033_wdaniel/CGNS/lib/libcgns.so.3.3 (0x00007ffb6997c000)
libhdf5.so.101 => /project/cacds/apps/HDF5/1.10.1-gcc-openmpi/lib/libhdf5.so.101 (0x00007ffb693af000)
libcudart.so.9.1 => /project/cacds/apps/easybuild/software/CUDA/9.1.85/lib64/libcudart.so.9.1 (0x00007ffb69141000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ffb68e39000)
libmpi.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libmpi.so.40 (0x00007ffb68b3c000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffb68926000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffb6870a000)
libc.so.6 => /lib64/libc.so.6 (0x00007ffb68349000)
/lib64/ld-linux-x86-64.so.2 (0x00007ffb69f42000)
libz.so.1 => /lib64/libz.so.1 (0x00007ffb68133000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007ffb67f2f000)
librt.so.1 => /lib64/librt.so.1 (0x00007ffb67d27000)
libopen-rte.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-rte.so.40 (0x00007ffb67a72000)
libopen-pal.so.40 => /project/cacds/apps/openmpi/3.1.0/gcc/lib/libopen-pal.so.40 (0x00007ffb67768000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007ffb6755c000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007ffb67359000)
And not sure whether this is helpful, but here are all the rpm's that I found on the node that have either cuda or nvidia in the name:
[egabriel@sabine ~]$ grep -i cuda rpmout.1
kmod-ifs-kernel-updates-3.10.0_514.44.1.el7.x86_64-1514cuda.x86_64
libpsm2-10.3.35-1cuda.x86_64
libpsm2-compat-10.3.35-1cuda.x86_64
ifs-kernel-updates-devel-3.10.0_514.44.1.el7.x86_64-1514cuda.x86_64
mpitests_openmpi_gcc_cuda_hfi-3.2-931.x86_64
cuda-drivers-390.30-1.x86_64
libpsm2-devel-10.3.35-1cuda.x86_64
openmpi_gcc_cuda_hfi-2.1.2-18.el7.x86_64
[egabriel@sabine ~]$ grep -i nvidia rpmout.1
xorg-x11-drv-nvidia-gl-390.30-1.el7.x86_64
nvidia-kmod-390.30-2.el7.x86_64
xorg-x11-drv-nvidia-devel-390.30-1.el7.x86_64
xorg-x11-drv-nvidia-390.30-1.el7.x86_64
pcp-pmda-nvidia-gpu-3.11.3-4.el7.x86_64
xorg-x11-drv-nvidia-libs-390.30-1.el7.x86_64
Anyway, any suggestions what precisely to look for would be appreciated, I am at this point out of ideas.
Hi @edgargabriel, all I can think at this point is finding different workloads that use cudaIpcOpenMemHandle(), or even one not using MPI, to see how it behaves. In addition, if your workload is publicly available, we could try to reproduce it on our side.
@matcabral I ran the simpleIPC test from the cuda9.1.85/samples directory. This test uses cudaIpcOpenMemHandle(), is as far as I understand however only designed for running on one node. It used both GPUs, and finished as far as I can tell correctly. I am still looking for an example that we could run across multiple nodes.
Regarding the code, the application is called bluebottle, and you can download it from github. https://github.com/groundcherry/bluebottle-3.0/tree/devel
There are very good instructions for compiling the application (requires HDF5 and CGNS in addition to MPI). I can send you also the smalles testcase that the user could produce which reliably failed on our cluster. It requires 2 nodes with 4 GPUs total. I would however prefer to send you the dataset by email or provide you the link separately, if that is ok.
So the application links against the libraries shown above (libcudart.so.9.1), but there is another library at play here from nvidia, example from 9.1 on a SLES 12.3 machine. (This is also running 9.1.85 FYI).
/usr/lib64/libcuda.so.1 -> libcuda.so.387.26
This is what they call the userspace driver library, and what typically returns the 30 error code back into the libcudart.so (runtime). you can normally see this versions also, from running
$ nvidia-smi
nvidia-smi
Wed Jun 20 16:09:57 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26 Driver Version: 387.26
.....
Any mismatches in runtime vs drivers are typically what cause the unknown error. I see that the RPM list you have installed is running a lot of stock in distro nvidia drivers. Potentially maybe there is some conflict between real nvidia drivers+sw and the in-distro versions?
i.e. nvidia-kmod-390.30-2.el7.x86_64 seems newer, maybe to new. I would be expecting 387.26, but perhaps that is due to SLES vs RHEL Please double check this.
@rwmcguir : I went to the Nvidia webpage, when you try to download the RHEL rpm's for CUDA 9.1, there is a big message showing up saying:
"Before installing the CUDA Toolkit on Linux, please ensure that you have the latest NVIDIA driver R390 installed. The latest NVIDIA R390 driver is available at: www.nvidia.com/drivers", see
@rwmcguir I also went to the CUDA driver download pages, and configured download for the drivers and CUDA version that we use, and the recommended driver from Nvidia was 390.46
Version: | 390.46
Release Date: | 2018.3.28
Operating System: | Linux 64-bit
CUDA Toolkit: | 9.1
Language: | English (US)
File Size: | 78.33 MB
@edgargabriel, I found this is still open. I understand libpsm2 updated substantial number of things in 11.x versions. Do you still see this issue?
@matcabral I think we can close it for now. Not sure whether we still see the issue or not, the last CUDA and PSM update that we did was in November. We found however a configuration on some (new) hardware that worked for this user ( single node with 8 GPUs), and because of that we haven't looked into this lately. Thanks!
Is this a duplicate of this issue?
https://github.com/intel/opa-psm2/issues/41
and
https://github.com/open-mpi/ompi/issues/6799
Looks like it was never got to the conclusion, but might be same effect.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.1.0rc2 and master
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
Please describe the system on which you are running
Details of the problem
We have a user code that is able that makes use of cuda-aware MPI features for direct data transfer across multiple GPUs. The code has utilized successfully fairly large InfiniBand clusters. We face however a problem when executing it on our Omnipath cluster
@bosilca pointed out to me the following commit
https://github.com/open-mpi/ompi/commit/2e83cf15ce790f89c782b6222253ab18252a7d2f
which is the reason we turned to the 3.1 release candidate, since this commit is part of this version.
The good news is, that using ompi 3.1.0.rc2, the code runs correctly on a single node /multi GPU environment. Running the code on multiple nodes and with multiple GPUs still fails however. A simple benchmark was able identify that direct transfer from GPU memory across multiple nodes works correctly up to a certain message length, but fails if the message length exceeds a threshold. The error message comes directly form the psm2 library, and is attached below.
My question is now, whether there is a minimum psm2 library version that is required to make this feature work correctly. Our cluster uses currently libpsm2-10.2.235, and there are obviously newer versions out there (newest one being 10.3.37 I think).
As a side note, we did manage to make the code work by using the verbs API and disabling cuda_async_recv, e.g.
but this slows down the communication performance quite a bit compared to using the psm2 library directly.