Segmentation fault with CUDA-aware MPI in Julia

simonbyrne commented 4 years ago

Describe the bug

Attempting to use UCX CUDA-aware MPI in Julia without UCX_MEM_EVENTS=no results in a segfault.

Downstream issue: https://github.com/JuliaParallel/MPI.jl/pull/370

Steps to Reproduce

Install Julia (https://julialang.org/downloads/) and run the following:

git clone -b sb/ucx-segfault https://github.com/JuliaParallel/MPI.jl
cd MPI.jl/test
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
mpiexec -n 1 julia --project=. xx.jl

This results in a ERROR: LoadError: ReadOnlyMemoryError() (which is Julia's report of a SIGSEGV). Running it in gdb gives

#0  0x00002aaaabbec436 in __memcpy_ssse3_back () from /lib64/libc.so.6

which suggests that UCX thinks the pointer is on the host.

├ ucx_info -v
# UCT version=1.8.0 revision c30b7da
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --disable-dependency-tracking --prefix=/central/software/ucx/1.8.0_cuda-10.0 --localstatedir=/var --sharedstatedir=/var/lib --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --with-cuda=/central/software/CUDA/10.0/

Run with

UCX_ERROR_SIGNALS = "SIGILL,SIGBUS,SIGFPE"

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)

├ uname -a
Linux hpc-23-28 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
├ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

For GPU related issues:

├ nvidia-smi
Tue Apr 21 16:36:02 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   23C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
├ lsmod|grep nv_peer_mem
├ lsmod|grep gdrdrv

Additional information (depending on the issue)

OpenMPI version:
- 4.0.3
Output of ucx_info -d to show transports and devices recognized by UCX
Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

bureddy commented 4 years ago

@simonbyrne can you check with option -x UCX_MEMTYPE_CACHE=n.

simonbyrne commented 4 years ago

@simonbyrne can you check with option -x UCX_MEMTYPE_CACHE=n.

That runs correctly.

simonbyrne commented 4 years ago

What is the difference between UCX_MEMTYPE_CACHE=n and UCX_MEM_EVENTS=n?

bureddy commented 4 years ago

UCX intercepts cudaMalloc/cudaFree calls and saves allocations in the cache to know memory location. these hooks are disabled with UCX_MEM_EVENTS=n, hence it can't find cuda allocation in the cache. we will fix this behavior to disable the cache with UCX_MEM_EVENTS=n by default.

simonbyrne commented 4 years ago

these hooks are disabled with UCX_MEM_EVENTS=n, hence it can't find cuda allocation in the cache.

Is there a way that we could manually notify the cache when an array is on the device? If so, I could add these to the Julia MPI bindings, which would avoid the need to intercept cudaMalloc.

bureddy commented 4 years ago

may I know what is the reason for running with UCX_MEM_EVENTS=n? cache is just optimization. library will call cuda api internally to detect memory type if cache is not enabled.
for the temporary workaround, it is acceptable to add UCX_MEMTYPE_CACHE=n along with UCX_MEM_EVENTS=n? in the next release, will autodetect it and disable cache internally.

simonbyrne commented 4 years ago

may I know what is the reason for running with UCX_MEM_EVENTS=n?

We had just copied what Slurm did to fix another issue (#4001). Now that it was fixed, I was looking to remove that but then ran into this issue.

Should we set both UCX_MEMTYPE_CACHE=n and UCX_MEM_EVENTS=n, or is the former sufficient?

bureddy commented 4 years ago

As issue #4001 is resolved, do you still need to set UCX_MEM_EVENTS=n, if not, can you remove both options and try

simonbyrne commented 4 years ago

As issue #4001 is resolved, do you still need to set UCX_MEM_EVENTS=n, if not, can you remove both options and try

I did, and I got the segfault that this issue is about.

simonbyrne commented 4 years ago

To clarify:

it runs correctly with just UCX_MEMTYPE_CACHE=n set
it runs correctly with just UCX_MEM_EVENTS=n set, but prints a warning "UCX ERROR failed to set UCM memtype event handler: Unsupported operation"
it produces a segfault (detailed at the top) if neither is set.

bureddy commented 4 years ago

Ah. Sorry, I misunderstood the issue. UCX_MEM_EVENTS=n is actually disabling memtype cache because it can't intercept the cuda calls. ok. may I know how cuda memory allocated in the application? and how it is linked to cuda runtime? currently, we also have an issue with mem type cache if the application is statically linked to cuda runtime

simonbyrne commented 4 years ago

That I’m not sure, but @vchuravy might be able to provide more information.

vchuravy commented 4 years ago

We dlopen libcuda directly and use the driver API to allocate memory. We don't go through the runtime libcudart, (see https://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html#driver-vs-runtime-api for the difference between the two).

For memory allocation we directly call cuMemAlloc (or sometimes cuMemAllocPitch).

bureddy commented 4 years ago

@vchuravy do you dlsym cuMemAlloc to get function pointer and use it? this could be a problem. ucx can intercept drive API cuMemAlloc,cuMemAllocPitch if it linked to at runtime.

vchuravy commented 4 years ago

Yes we use dlsym to obtain the pointer.

bureddy commented 4 years ago

@vchuravy unfortunately, ucx can't handle this scenario. please continue with workaround ( https://github.com/JuliaParallel/MPI.jl/blob/5ae9eb8dc2885b8624fe48952c4791977ee0fac5/src/MPI.jl#L85 ) for now.

shamisp commented 4 years ago

@bureddy - it seems like the language is aware about type of the pointer but with MPI you cannot really pass this through.

vchuravy commented 4 years ago

We would be happy to call a ucx function an inform you of the memory kind we are passing in, but yes MPI is not the ideal programming interface for this. We are also looking into using UCX directly, like PyUCX.

shamisp commented 4 years ago

@vchuravy as for now ucx does not accept memory type as a parameter but this can be introduced. @bureddy If you will let user to pass datatype + stream it will open a lot of opportunities for non-MPI applications. Seems like majority of apps well aware about origin of the memory.

bureddy commented 4 years ago

@shamisp yes. in the last f2f meeting, we briefly discussed to enhance UCX API to pass mem type + stream. I think this will get enough priority going forward.

openucx / ucx