Problem with mpirun + ugni btl on Cray XC50

angainor commented 5 years ago

I'm struggling a bit with making OpenMPI work on the Piz Daint, which is a Cray XC50 system. For some reason with the 4.0.1 and 4.0.2 releases I can't use the ugni btl when starting my jobs using mpirun. I configure OpenMPI as follows:

module unload PrgEnv-cray
module load PrgEnv-gnu
module switch gcc/8.3.0
./configure --prefix=<path> --with-pmi=/opt/cray/pe/pmi/5.0.14/ --enable-mca-no-build=btl-uct

Then, I run my program:

mpirun -map-by node -mca btl_base_verbose 100 -mca plm_base_verbose 100 ./mpitest
[...]
[nid04277:17445] [[21474,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --ntasks=2 orted -mca ess "slurm" -mca ess_base_jobid "1407320064" -mca ess_base_vpid "1" -mca ess_base_num_procs "3" -mca orte_node_regex "mpirun,nid[5:4277-4278]@0(3)" -mca orte_hnp_uri "1407320064.0;tcp://148.187.48.214:52147"
[...]
[nid04277:17586] mca: base: components_open: found loaded component ugni
[nid04277:17586] mca: base: components_open: component ugni open function successful
[nid04277:17586] select: initializing btl component ugni
[nid04277:17586] select: init of component ugni returned failure
[nid04278:27844] mca: bml: Using self btl for send to [[21474,1],1] on node nid04278
[nid04277:17586] mca: bml: Using self btl for send to [[21474,1],0] on node nid04277
[nid04278:27844] mca: bml: Using tcp btl for send to [[21474,1],0] on node nid04277
[nid04277:17586] mca: bml: Using tcp btl for send to [[21474,1],1] on node nid04278
[...]

The same binary started with srun manages to initialize the ugni btl correctly:

srun ./mpitest
[...]
[nid04278:27911] mca: base: components_open: found loaded component ugni
[nid04278:27911] mca: base: components_open: component ugni open function successful
[nid04278:27911] select: initializing btl component ugni
[nid04277:17732] select: init of component ugni returned success
[nid04278:27911] mca: bml: Using self btl for send to [[287,0],1] on node nid04278
[nid04277:17732] mca: bml: Using self btl for send to [[287,0],0] on node nid04277
[nid04278:27911] mca: bml: Using ugni btl for send to [[287,0],0] on node nid04277
[nid04277:17732] mca: bml: Using ugni btl for send to [[287,0],1] on node nid04278
[...]

Not sure why that is, but it seems that on this system and with v4.x, when programs are started with mpirun they do not see the ARIES interconnect for some reason. I thought this could be some permissions issue, but I've verified that with ompi v3.1.4 both startup methods work correctly.

Does anyone have a clue what could be the reason for such change? Has anyone experienced a similar problem, or could it be the local system configration and/or compilation options that are causing this? I'd appreciate any help.. Thanks!

hppritcha commented 5 years ago

I suspect for some reason the alps odls layer isn't been used in ORTE. Could you rerun using mpirun with the following:

add this env. variable setting to your shell before running mpirun: export OMPI_MCA_odls_base_verbose=100

and adding --debug-daemons to the mpirun command line?

If you aren't using the alps odls (yes the name is misleading, its the one you want to be using to get the Aries RDMA credentials needed by the uGNI BTL), the uGNI btl can't initialize.

There is one other suggestion, don't explicitly request PMI support. Open MPI configury should be able to detect the cray-pmi support available without you needing to specify --with-pmi on the config line.

angainor commented 5 years ago

@hppritcha Thank you! That seems to be the problem:

[nid03509:08523] mca:base:select:( odls) Querying component [pspawn]
[nid03509:08523] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[nid03509:08523] mca:base:select:( odls) Querying component [default]
[nid03509:08523] mca:base:select:( odls) Query of component [default] set priority to 10
[nid03509:08523] mca:base:select:( odls) Querying component [alps]
[nid03509:08523] mca:base:select:( odls) Query of component [alps] set priority to 10
[nid03509:08523] mca:base:select:( odls) Selected component [default]

So I guess the priority is too low. I chose alps manually with export OMPI_MCA_odls=alps and things work now!

angainor commented 5 years ago

@hppritcha Compiling without the --with-pmi option also worked nice, thanks!

Now I try to run the OSU benchmark with CUDA, and that doesn't go very well. I get the following segfault:

mpirun -bind-to numa -map-by node -np 2 ./osu_bw D D
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[nid04276:26561] *** Process received signal ***
[nid04276:26561] Signal: Segmentation fault (11)
[nid04276:26561] Signal code: Invalid permissions (2)
[nid04276:26561] Failing at address: 0x3114e00000
[nid04276:26561] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x2aaaaca97c10]
[nid04276:26561] [ 1] /lib64/libc.so.6(+0x12c98a)[0x2aaaacdd098a]
[nid04276:26561] [ 2] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libopen-pal.so.40(opal_convertor_pack+0x1ab)[0x2aaaad99c72b]
[nid04276:26561] [ 3] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_btl_ugni.so(mca_btl_ugni_sendi+0x145)[0x2aaad1899535]
[nid04276:26561] [ 4] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so(+0xb56f)[0x2aaad1f4c56f]
[nid04276:26561] [ 5] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4bf)[0x2aaad1f4d19f]
[nid04276:26561] [ 6] /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libmpi.so.40(MPI_Isend+0x105)[0x2aaaabf38895]
[nid04276:26561] [ 7] ./osu_bw[0x40239f]
[nid04276:26561] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaccc4735]
[nid04276:26561] [ 9] ./osu_bw[0x402589]
[nid04276:26561] *** End of error message ***

The test runs fine for host-host, and it runs fine for device-device transfers when I do not use the ugni btl (-mca btl ^ugni). Looking at the backtrace in the core dump:

#0  0x00002aaaacdd098a in __memcpy_avx_unaligned () from /lib64/libc.so.6
#1  0x00002aaaad99c72b in opal_convertor_pack () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libopen-pal.so.40
#2  0x00002aaad1899535 in mca_btl_ugni_sendi () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_btl_ugni.so
#3  0x00002aaad1f4c56f in mca_pml_ob1_send_inline.constprop ()
   from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so
#4  0x00002aaad1f4d19f in mca_pml_ob1_isend () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/openmpi/mca_pml_ob1.so
#5  0x00002aaaabf38895 in PMPI_Isend () from /scratch/snx3000/mkrotkie/software/openmpi/4.0.2-noucx/lib/libmpi.so.40
#6  0x000000000040239f in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:117

So it looks like I have some sort of memory access permission issue here. Does the above look familiar to you? Could it be a similar issue as the original one?

Thanks!

angainor commented 5 years ago

A short update: I checked that GDR support has been configured. From opal/include/opal_config.h:

/* Whether we have CUDA GDR support available */
#define OPAL_CUDA_GDR_SUPPORT 1

/* Whether we have CUDA cuPointerGetAttributes function available */
#define OPAL_CUDA_GET_ATTRIBUTES 1

/* Whether we want cuda device pointer support */
#define OPAL_CUDA_SUPPORT 1

/* Whether we have CUDA CU_POINTER_ATTRIBUTE_SYNC_MEMOPS support available */
#define OPAL_CUDA_SYNC_MEMOPS 1

Also, the same benchmark works when compiled with Cray MPICH, so it seems there are no fundamental problems.

angainor commented 5 years ago

@hppritcha eh, that was of course my fault and our local configuration problem. The smcuda btl was not enabled as I had btl = ugni,vader,self in my /etc/openmpi-mca-params.conf. Still, a segfault is a bit harsh, I guess. Anyway, now that I added smcuda to the list - things work.

@hjelmn FYI, here are the Device to Device benchmark results (srun -n 2 ./osu_bw D D) when I use the ugni btl (btl = ugni,smcuda,self)

[nid03508:04907] mca: bml: Using self btl for send to [[50414,0],0] on node nid03508
[nid03509:16455] mca: bml: Using self btl for send to [[50414,0],1] on node nid03509
[nid03508:04907] mca: bml: Using ugni btl for send to [[50414,0],1] on node nid03509
[nid03509:16455] mca: bml: Using ugni btl for send to [[50414,0],0] on node nid03508
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.08
2                       0.16
4                       0.32
8                       0.64
16                      1.27
32                      2.55
64                      5.11
128                    10.15
256                    20.28
512                    40.16
1024                   78.23
2048                  151.98
4096                  286.25
8192                  253.30
16384                 353.97
32768                 443.71
65536                 508.65
131072                548.21
262144                569.37
524288                577.19
1048576               580.77
2097152               580.27
4194304               580.69

The performance is quite low: host to host using UGNI delivers ~9GB/s here. Also, if I use the tcp btl, then I get better results than with UGNI for large messages (btl = tcp,smcuda,self):

[nid03508:02621] mca: bml: Using self btl for send to [[50322,0],0] on node nid03508
[nid03509:14485] mca: bml: Using self btl for send to [[50322,0],1] on node nid03509
[nid03508:02621] mca: bml: Using tcp btl for send to [[50322,0],1] on node nid03509
[nid03509:14485] mca: bml: Using tcp btl for send to [[50322,0],0] on node nid03508
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.06
2                       0.11
4                       0.23
8                       0.45
16                      0.90
32                      1.79
64                      3.59
128                     7.17
256                    14.33
512                    28.68
1024                   56.72
2048                  109.46
4096                  212.96
8192                  397.62
16384                 702.79
32768                1135.74
65536                1000.17
131072               1140.95
262144               1304.91
524288               1619.99
1048576              1837.98
2097152              1990.24
4194304              2065.93

As a side note, for large buffers Cray MPICH delivers 8.5GB/s here:

# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.47
2                       0.94
4                       1.88
8                       3.73
16                      7.46
32                     14.98
64                     29.88
128                    59.42
256                   120.08
512                   240.06
1024                  481.63
2048                  948.49
4096                 1735.01
8192                 2417.07
16384                2866.11
32768                3595.97
65536                5220.70
131072               6732.95
262144               7663.98
524288               8172.24
1048576              8476.50
2097152              8619.59
4194304              8691.77

This is the first time I run GPU device to device tests on a Cray, so I do not know how this should look. Also, I should note that at this point we do not have the gdrcopy module installed on Piz Daint, in case that would improve things.

@hjelmn, any chance you could comment on the above performance results? Thanks!

hppritcha commented 5 years ago

@angainor the uGNI BTL itself has no CUDA smarts whatsoever, whereas the Cray MPICH has had quite a lot of work done to support GPU buffers. In particular, it has support for registering GPU memory with the Aries (at least for Nvidia devices). Looks like from the performance that there's at least a copy in/copy out happening when using the uGNI BTL, and possibly a non-lazy memory registration going on for host based bounce buffers.

angainor commented 5 years ago

@hppritcha Thanks for the info. Out of curiosity: you say that Cray folks register the GPU memory with the interconnect. Do you mean that Aries supports GPU Direct RDMA? Or is there another way of doing this? Couldn't find anything informative about this.

Also, do you know if there will be effort within OpenMPI to optimize this case, or whether that be targeted by UCX? Or will I have to use Cray MPI on this system in the future?

Thanks!

hppritcha commented 4 years ago

@angainor sorry I was on vacation and lost track of this. Yes there are options with the uGNI API to register GPU memory with the Aries. There were modifications made to the Aries device driver to support this quite a while ago.

we hope to work with Cray to do better with Open MPI + GPUs for the shasta based systems.

angainor commented 4 years ago

@hppritcha Thanks for the update! looking forward to this!

hppritcha commented 4 years ago

@angainor can we close this issue?

angainor commented 4 years ago

@hppritcha Yes, thanks for your help!

open-mpi / ompi

Problem with mpirun + ugni btl on Cray XC50 #7064