openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

Does CUDA memtype cache handle applications or other libraries that also intercept cu*alloc? #4540

Open Akshay-Venkatesh opened 4 years ago

Akshay-Venkatesh commented 4 years ago

@bureddy @yosefe

Users of applications which have their own memory pools (which internally intercept cudaMalloc for example) have seen crashes with memtype cache. Some examples include users of rapidsai/rmm, and AMGX. I know that recent changes to memtype cache is supposed to handle allocations before ucp_init but are interceptions by other agents also handled?

cc @abellina

bureddy commented 4 years ago

@Akshay-Venkatesh can you point me intercepting code in these libs? I could not find it.

bureddy commented 4 years ago

probably cuda runtime func symbol names might have mangled by C++ compiler in these libs. ucx might have failed to intercept these mangled sym names with dlsym(). can you check if cuda* symbol names mangled in these libraries?

Akshay-Venkatesh commented 4 years ago

@Akshay-Venkatesh can you point me intercepting code in these libs? I could not find it.

@marsaev can you point to the interception code?

Akshay-Venkatesh commented 4 years ago

probably cuda runtime func symbol names might have mangled by C++ compiler in these libs. ucx might have failed to intercept these mangled sym names with dlsym(). can you check if cuda* symbol names mangled in these libraries?

Will try and find and reply back here

bureddy commented 4 years ago

can you also check if these libs are statically linked to libcudart_static.a? I see following in the instructions

Build and install librmm using cmake & make. CMake depends on the nvcc executable being on your path or defined in $CUDACXX.

AFAIK, nvcc statically links cuda runtime, if yes, we have known issues in intercepting in these cases

marsaev commented 4 years ago

@Akshay-Venkatesh https://github.com/NVIDIA/AMGX/blob/b3101ffaaddee71c32ad53a151cf0e87a31b59a8/base/include/global_thread_handle.h#L232 Just to be clear - we intercept calls on source level in the separate namespace, internally we call same cudaMalloc from global namespace that is intended to call CUDA runtime (like this: https://github.com/NVIDIA/AMGX/blob/b3101ffaaddee71c32ad53a151cf0e87a31b59a8/base/src/global_thread_handle.cu#L892). We do not redefine symbols on global level or do LD_PRELOAD. Static CUDA runtime is linked by default (handled by cmake).

jirikraus commented 4 years ago

I tried to link AMGX with the shared CUDA Runtime and can confirm that this is indeed the issue. When the static CUDA Runtime is avoided the UCX pointer cache no longer causes issues with AMGX.

Akshay-Venkatesh commented 4 years ago

can you also check if these libs are statically linked to libcudart_static.a? I see following in the instructions

Build and install librmm using cmake & make. CMake depends on the nvcc executable being on your path or defined in $CUDACXX.

AFAIK, nvcc statically links cuda runtime, if yes, we have known issues in intercepting in these cases

@bureddy As @jirikraus confirmed, this does fall into the known issue case. Can you link that github issue here so that we can close both when the issue gets fixed?

bureddy commented 4 years ago

related to https://github.com/openucx/ucx/issues/3210

shamisp commented 4 years ago

@bureddy please bring this up during f2f. The issue of static builds is not going to disappear .....