tkestack / vcuda-controller

Other
488 stars 156 forks source link

Why are these symbols marked as "deprecated"? #37

Closed amet13x closed 1 year ago

amet13x commented 1 year ago

With poor amount of mentioning over Internet, the symbols' existence in original libcuda.so can still be confirmed by

nm -D /lib64/libcuda.so | grep " T "

The driver version is 535.

Here are the deprecated symbols: https://github.com/tkestack/vcuda-controller/blob/72e0115d5884f22469de857271c002c84c0d0543/include/cuda-helper.h#L674-L676

mYmNeo commented 1 year ago

Because cuMemGetAttribute and cuMemGetAttribute_v2 always returns 801,it's a non-useful function

amet13x commented 1 year ago

There are several issues of other projects (like this) , which contains core dump backtrace showing cuMemGetAttribute() in the middle of the call stack. It is a little weird to me to describe functions behaving like those as non-useful. I would like to know how you got the return value? More specifically, could you recall what arguments were passed to these functions and what context such call happened in, when you got 801? Thanks for your help.

mYmNeo commented 1 year ago

You can disassemble the function

amet13x commented 1 year ago

Thanks for the tip, especially for some initiates like me. The function disassembled is very clear.

It is interesting though, for I did read some backtraces of some discussions like this, where the symbol cuMemGetAttribute_v2 appears like some internal call.

When I was trying to replay the backtrace using code above (with some trivial corrections) using driver version 535 and CUDA 12.2 (yes, I know this project is yet not compatible with these), I found that internal cuda-related symbols in backtrace are no longer visible for end-users. One typical backtrace is something like this, quoted from https://developer.nvidia.com/blog/cuda-toolkit-symbol-server/ :

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault
0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

And of course I tried to apply the unstrip solution offered, and got similar outcome with obfuscated symbols, like:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault
0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in libcuda_10c0735c5053f532d0a8bdb0959e754c2e7a4e3d () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in libcudart_43d9a0d553511aed66b6c644856e24b360d81d0c () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

What confused me is why the backtrace could look like this (also pasted below), where CUDA version is not mentioned but must be lower than CUDA 8.0, according to CUDA release note:

The following example of CUDA 8 or lower is quoted because it is the easiest way I found to replay such a trace. Some users using more modern CUDA versions like 10.2 also reported similar backtraces.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff501dd00 in cudbgGetAPIVersion () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) backtrace
#0  0x00007ffff501dd00 in cudbgGetAPIVersion () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff4efc68e in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff4f0cc7f in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffff4efd7f1 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff4e6b322 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007ffff4e74b38 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007ffff4e4d92a in cuMemcpy2DUnaligned_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x000000000045bc5d in cudart::driverHelper::memcpy2DPtr(char*, unsigned long, char const*, unsigned long, unsigned long, unsigned long, cudaMemcpyKind, CUstream_st*, bool, bool) ()
#8  0x0000000000435039 in cudart::cudaApiMemcpy2DCommon(void*, unsigned long, void const*, unsigned long, unsigned long, unsigned long, cudaMemcpyKind, bool) ()
#9  0x00000000004350f8 in cudart::cudaApiMemcpy2D(void*, unsigned long, void const*, unsigned long, unsigned long, unsigned long, cudaMemcpyKind) ()
#10 0x0000000000462073 in cudaMemcpy2D ()

I assume that CUDA driver symbols above might be fake ones.

For now, I don't think this issue matters anymore. Thanks for your help again. @mYmNeo