Open simonbyrne opened 4 years ago
@simonbyrne can you check with option -x UCX_MEMTYPE_CACHE=n.
@simonbyrne can you check with option -x UCX_MEMTYPE_CACHE=n.
That runs correctly.
What is the difference between UCX_MEMTYPE_CACHE=n
and UCX_MEM_EVENTS=n
?
UCX intercepts cudaMalloc/cudaFree calls and saves allocations in the cache to know memory location. these hooks are disabled with UCX_MEM_EVENTS=n, hence it can't find cuda allocation in the cache. we will fix this behavior to disable the cache with UCX_MEM_EVENTS=n by default.
these hooks are disabled with UCX_MEM_EVENTS=n, hence it can't find cuda allocation in the cache.
Is there a way that we could manually notify the cache when an array is on the device? If so, I could add these to the Julia MPI bindings, which would avoid the need to intercept cudaMalloc.
may I know what is the reason for running with UCX_MEM_EVENTS=n?
cache is just optimization. library will call cuda api internally to detect memory type if cache is not enabled.
for the temporary workaround, it is acceptable to add UCX_MEMTYPE_CACHE=n along with UCX_MEM_EVENTS=n? in the next release, will autodetect it and disable cache internally.
may I know what is the reason for running with UCX_MEM_EVENTS=n?
We had just copied what Slurm did to fix another issue (#4001). Now that it was fixed, I was looking to remove that but then ran into this issue.
Should we set both UCX_MEMTYPE_CACHE=n
and UCX_MEM_EVENTS=n
, or is the former sufficient?
As issue #4001 is resolved, do you still need to set UCX_MEM_EVENTS=n, if not, can you remove both options and try
As issue #4001 is resolved, do you still need to set UCX_MEM_EVENTS=n, if not, can you remove both options and try
I did, and I got the segfault that this issue is about.
To clarify:
UCX_MEMTYPE_CACHE=n
setUCX_MEM_EVENTS=n
set, but prints a warning "UCX ERROR failed to set UCM memtype event handler: Unsupported operation"Ah. Sorry, I misunderstood the issue. UCX_MEM_EVENTS=n is actually disabling memtype cache because it can't intercept the cuda calls. ok. may I know how cuda memory allocated in the application? and how it is linked to cuda runtime? currently, we also have an issue with mem type cache if the application is statically linked to cuda runtime
That I’m not sure, but @vchuravy might be able to provide more information.
We dlopen
libcuda
directly and use the driver API to allocate memory. We don't go through the runtime libcudart
, (see https://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html#driver-vs-runtime-api for the difference between the two).
For memory allocation we directly call cuMemAlloc
(or sometimes cuMemAllocPitch
).
@vchuravy do you dlsym cuMemAlloc to get function pointer and use it? this could be a problem. ucx can intercept drive API cuMemAlloc,cuMemAllocPitch if it linked to at runtime.
Yes we use dlsym
to obtain the pointer.
@vchuravy unfortunately, ucx can't handle this scenario. please continue with workaround ( https://github.com/JuliaParallel/MPI.jl/blob/5ae9eb8dc2885b8624fe48952c4791977ee0fac5/src/MPI.jl#L85 ) for now.
@bureddy - it seems like the language is aware about type of the pointer but with MPI you cannot really pass this through.
We would be happy to call a ucx
function an inform you of the memory kind we are passing in, but yes MPI is not the ideal programming interface for this.
We are also looking into using UCX directly, like PyUCX.
@vchuravy as for now ucx does not accept memory type as a parameter but this can be introduced. @bureddy If you will let user to pass datatype + stream it will open a lot of opportunities for non-MPI applications. Seems like majority of apps well aware about origin of the memory.
@shamisp yes. in the last f2f meeting, we briefly discussed to enhance UCX API to pass mem type + stream. I think this will get enough priority going forward.
Describe the bug
Attempting to use UCX CUDA-aware MPI in Julia without
UCX_MEM_EVENTS=no
results in a segfault.Downstream issue: https://github.com/JuliaParallel/MPI.jl/pull/370
Steps to Reproduce
Install Julia (https://julialang.org/downloads/) and run the following:
This results in a
ERROR: LoadError: ReadOnlyMemoryError()
(which is Julia's report of aSIGSEGV
). Running it ingdb
giveswhich suggests that UCX thinks the pointer is on the host.
Run with
UCX_ERROR_SIGNALS = "SIGILL,SIGBUS,SIGFPE"
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
For GPU related issues:
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX