Closed angainor closed 4 years ago
I suspect for some reason the alps odls layer isn't been used in ORTE. Could you rerun using mpirun with the following:
add this env. variable setting to your shell before running mpirun: export OMPI_MCA_odls_base_verbose=100
and adding --debug-daemons to the mpirun command line?
If you aren't using the alps odls (yes the name is misleading, its the one you want to be using to get the Aries RDMA credentials needed by the uGNI BTL), the uGNI btl can't initialize.
There is one other suggestion, don't explicitly request PMI support. Open MPI configury should be able to detect the cray-pmi support available without you needing to specify --with-pmi
on the config line.
@hppritcha Thank you! That seems to be the problem:
[nid03509:08523] mca:base:select:( odls) Querying component [pspawn]
[nid03509:08523] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[nid03509:08523] mca:base:select:( odls) Querying component [default]
[nid03509:08523] mca:base:select:( odls) Query of component [default] set priority to 10
[nid03509:08523] mca:base:select:( odls) Querying component [alps]
[nid03509:08523] mca:base:select:( odls) Query of component [alps] set priority to 10
[nid03509:08523] mca:base:select:( odls) Selected component [default]
So I guess the priority is too low. I chose alps
manually with export OMPI_MCA_odls=alps
and things work now!
@hppritcha Compiling without the --with-pmi
option also worked nice, thanks!
Now I try to run the OSU benchmark with CUDA, and that doesn't go very well. I get the following segfault:
mpirun -bind-to numa -map-by node -np 2 ./osu_bw D D
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
[nid04276:26561] *** Process received signal ***
[nid04276:26561] Signal: Segmentation fault (11)
[nid04276:26561] Signal code: Invalid permissions (2)
[nid04276:26561] Failing at address: 0x3114e00000
[nid04276:26561] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x2aaaaca97c10]
[nid04276:26561] [ 1] /lib64/libc.so.6(+0x12c98a)[0x2aaaacdd098a]
[nid04276:26561] [ 2] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libopen-pal.so.40(opal_convertor_pack+0x1ab)[0x2aaaad99c72b]
[nid04276:26561] [ 3] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_btl_ugni.so(mca_btl_ugni_sendi+0x145)[0x2aaad1899535]
[nid04276:26561] [ 4] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so(+0xb56f)[0x2aaad1f4c56f]
[nid04276:26561] [ 5] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4bf)[0x2aaad1f4d19f]
[nid04276:26561] [ 6] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libmpi.so.40(MPI_Isend+0x105)[0x2aaaabf38895]
[nid04276:26561] [ 7] ./osu_bw[0x40239f]
[nid04276:26561] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaccc4735]
[nid04276:26561] [ 9] ./osu_bw[0x402589]
[nid04276:26561] *** End of error message ***
The test runs fine for host-host, and it runs fine for device-device transfers when I do not use the ugni btl (-mca btl ^ugni
). Looking at the backtrace in the core dump:
#0 0x00002aaaacdd098a in __memcpy_avx_unaligned () from /lib64/libc.so.6
#1 0x00002aaaad99c72b in opal_convertor_pack () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libopen-pal.so.40
#2 0x00002aaad1899535 in mca_btl_ugni_sendi () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_btl_ugni.so
#3 0x00002aaad1f4c56f in mca_pml_ob1_send_inline.constprop ()
from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so
#4 0x00002aaad1f4d19f in mca_pml_ob1_isend () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so
#5 0x00002aaaabf38895 in PMPI_Isend () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libmpi.so.40
#6 0x000000000040239f in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:117
So it looks like I have some sort of memory access permission issue here. Does the above look familiar to you? Could it be a similar issue as the original one?
Thanks!
A short update: I checked that GDR support has been configured. From opal/include/opal_config.h
:
/* Whether we have CUDA GDR support available */
#define OPAL_CUDA_GDR_SUPPORT 1
/* Whether we have CUDA cuPointerGetAttributes function available */
#define OPAL_CUDA_GET_ATTRIBUTES 1
/* Whether we want cuda device pointer support */
#define OPAL_CUDA_SUPPORT 1
/* Whether we have CUDA CU_POINTER_ATTRIBUTE_SYNC_MEMOPS support available */
#define OPAL_CUDA_SYNC_MEMOPS 1
Also, the same benchmark works when compiled with Cray MPICH, so it seems there are no fundamental problems.
@hppritcha eh, that was of course my fault and our local configuration problem. The smcuda
btl was not enabled as I had btl = ugni,vader,self
in my /etc/openmpi-mca-params.conf
. Still, a segfault is a bit harsh, I guess. Anyway, now that I added smcuda
to the list - things work.
@hjelmn FYI, here are the Device to Device benchmark results (srun -n 2 ./osu_bw D D
) when I use the ugni
btl (btl = ugni,smcuda,self
)
[nid03508:04907] mca: bml: Using self btl for send to [[50414,0],0] on node nid03508
[nid03509:16455] mca: bml: Using self btl for send to [[50414,0],1] on node nid03509
[nid03508:04907] mca: bml: Using ugni btl for send to [[50414,0],1] on node nid03509
[nid03509:16455] mca: bml: Using ugni btl for send to [[50414,0],0] on node nid03508
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.08
2 0.16
4 0.32
8 0.64
16 1.27
32 2.55
64 5.11
128 10.15
256 20.28
512 40.16
1024 78.23
2048 151.98
4096 286.25
8192 253.30
16384 353.97
32768 443.71
65536 508.65
131072 548.21
262144 569.37
524288 577.19
1048576 580.77
2097152 580.27
4194304 580.69
The performance is quite low: host to host using UGNI delivers ~9GB/s here. Also, if I use the tcp
btl, then I get better results than with UGNI for large messages (btl = tcp,smcuda,self
):
[nid03508:02621] mca: bml: Using self btl for send to [[50322,0],0] on node nid03508
[nid03509:14485] mca: bml: Using self btl for send to [[50322,0],1] on node nid03509
[nid03508:02621] mca: bml: Using tcp btl for send to [[50322,0],1] on node nid03509
[nid03509:14485] mca: bml: Using tcp btl for send to [[50322,0],0] on node nid03508
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.06
2 0.11
4 0.23
8 0.45
16 0.90
32 1.79
64 3.59
128 7.17
256 14.33
512 28.68
1024 56.72
2048 109.46
4096 212.96
8192 397.62
16384 702.79
32768 1135.74
65536 1000.17
131072 1140.95
262144 1304.91
524288 1619.99
1048576 1837.98
2097152 1990.24
4194304 2065.93
As a side note, for large buffers Cray MPICH delivers 8.5GB/s here:
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.47
2 0.94
4 1.88
8 3.73
16 7.46
32 14.98
64 29.88
128 59.42
256 120.08
512 240.06
1024 481.63
2048 948.49
4096 1735.01
8192 2417.07
16384 2866.11
32768 3595.97
65536 5220.70
131072 6732.95
262144 7663.98
524288 8172.24
1048576 8476.50
2097152 8619.59
4194304 8691.77
This is the first time I run GPU device to device tests on a Cray, so I do not know how this should look. Also, I should note that at this point we do not have the gdrcopy
module installed on Piz Daint, in case that would improve things.
@hjelmn, any chance you could comment on the above performance results? Thanks!
@angainor the uGNI BTL itself has no CUDA smarts whatsoever, whereas the Cray MPICH has had quite a lot of work done to support GPU buffers. In particular, it has support for registering GPU memory with the Aries (at least for Nvidia devices). Looks like from the performance that there's at least a copy in/copy out happening when using the uGNI BTL, and possibly a non-lazy memory registration going on for host based bounce buffers.
@hppritcha Thanks for the info. Out of curiosity: you say that Cray folks register the GPU memory with the interconnect. Do you mean that Aries supports GPU Direct RDMA? Or is there another way of doing this? Couldn't find anything informative about this.
Also, do you know if there will be effort within OpenMPI to optimize this case, or whether that be targeted by UCX? Or will I have to use Cray MPI on this system in the future?
Thanks!
@angainor sorry I was on vacation and lost track of this. Yes there are options with the uGNI API to register GPU memory with the Aries. There were modifications made to the Aries device driver to support this quite a while ago.
we hope to work with Cray to do better with Open MPI + GPUs for the shasta based systems.
@hppritcha Thanks for the update! looking forward to this!
@angainor can we close this issue?
@hppritcha Yes, thanks for your help!
I'm struggling a bit with making OpenMPI work on the Piz Daint, which is a Cray XC50 system. For some reason with the 4.0.1 and 4.0.2 releases I can't use the ugni btl when starting my jobs using
mpirun
. I configure OpenMPI as follows:Then, I run my program:
The same binary started with
srun
manages to initialize the ugni btl correctly:Not sure why that is, but it seems that on this system and with v4.x, when programs are started with
mpirun
they do not see the ARIES interconnect for some reason. I thought this could be some permissions issue, but I've verified that with ompi v3.1.4 both startup methods work correctly.Does anyone have a clue what could be the reason for such change? Has anyone experienced a similar problem, or could it be the local system configration and/or compilation options that are causing this? I'd appreciate any help.. Thanks!