Closed t-arsicaud-catie closed 10 months ago
@t-arsicaud-catie Can you re-run with the environment variable NVSHARE_DEBUG=1
set when launching both the client (Pytorch application) and nvshare scheduler and post the logs here?
(In other words, run with LD_PRELOAD=... NVSHARE_DEBUG=1 python3 ...
)
Hi,
In fact, I don't get any debug output in the pytorch app terminal, with torch==2.1.0
.
And in the nvshare-scheduler
terminal, only the following :
[NVSHARE][INFO]: nvshare-scheduler started in debug mode
[NVSHARE][INFO]: nvshare-scheduler listening on /var/run/nvshare/scheduler.sock
While when i run the code with torch==2.0.1
, I get :
[[NVSHARE][DEBUG]: Found NVML
[NVSHARE][DEBUG]: NVSHARE_POD_NAME = none
[NVSHARE][DEBUG]: NVSHARE_POD_NAMESPACE = none
[NVSHARE][DEBUG]: Sent REGISTER
[NVSHARE][DEBUG]: Received SCHED_ON
[NVSHARE][INFO]: Successfully initialized nvshare GPU
[NVSHARE][INFO]: Client ID = 126fa39e39707cfa
[NVSHARE][DEBUG]: real_cuMemGetInfo returned free=14807.56 MiB, total=14930.56 MiB
[NVSHARE][DEBUG]: nvshare's cuMemGetInfo returning free=13394.56 MiB, total=14930.56 MiB
[NVSHARE][DEBUG]: cuMemAlloc requested 2097152 bytes
[NVSHARE][DEBUG]: cuMemAllocManaged allocated 2097152 bytes at 0x7f862a000000
[NVSHARE][DEBUG]: Total allocated memory on GPU is 2.00 MiB
[NVSHARE][DEBUG]: Received LOCK_OK
[NVSHARE][DEBUG]: cuMemAlloc requested 1024 bytes
...
...
...
...in the app terminal, and :
[NVSHARE][INFO]: nvshare-scheduler started in debug mode
[NVSHARE][INFO]: nvshare-scheduler listening on /var/run/nvshare/scheduler.sock
[NVSHARE][INFO]: Received REGISTER
[NVSHARE][INFO]: Sent SCHED_ON to client 126fa39e39707cfa
[NVSHARE][INFO]: Registered client 126fa39e39707cfa with Pod name = none, Pod namespace = none
[NVSHARE][INFO]: Received REQ_LOCK from 126fa39e39707cfa
[NVSHARE][INFO]: Sent LOCK_OK to client 126fa39e39707cfa
[NVSHARE][DEBUG]: Client 126fa39e39707cfa has closed the connection
[NVSHARE][INFO]: Removing client 126fa39e39707cfa
[NVSHARE][DEBUG]: try_schedule() called with no pending requests
...in the nvshare-scheduler
terminal.
In both cases, torch==2.0.1
and torch==2.1.0
, nvidia-smi
shows the the app accesses to the GPU.
This is weird.
We need to verify if the Pytorch 2.1.0 application is indeed making the CUDA calls that nvshare
hooks.
Can you run gdb python3 ...
for the 2.1.0 application and add breakpoints for cuInit
and cuMemAlloc
?
You can do this with the break cuInit
and break cuMemAlloc
gdb commands.
Then, paste the logs here.
I am not used to using gdb but I suppose this what your asking for :
(running the script with torch==2.1.0
)
with breakpoints on cuInit
, cuMemAlloc
and cudaMalloc
:
[New Thread 0x7fff4bc39700 (LWP 24249)]
[New Thread 0x7fff4b438700 (LWP 24250)]
...
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
Thread 1 "python" hit Breakpoint 3, 0x00007fffbc91c500 in cudaMalloc () from /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
Thread 1 "python" hit Breakpoint 3, 0x00007fffbc91c500 in cudaMalloc () from /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
...
So no hits to /usr/local/lib/libnvshare.so
.
With previous versions of torch, calls to cuInit
and cudaMalloc
appear in both /usr/local/lib/libnvshare.so
and /lib/x86_64-linux-gnu/libcuda.so.1
.
For the tests, I just switch from one virtual environment to an other, keeping the LD_PRELOAD
and CUDA_VISIBLE_DEVICES
environment variables.
Good job with gdb
!
However, there is a little problem.
cuMemAlloc
(i.e., the Driver API function) is the function we hook in nvshare
.
You mistakenly added a breakpoint for cudaMalloc
(i.e., the Runtime API function which internally calls cuMemAlloc
), so it is natural that we don't see a hit for libnvshare.so
.
Could you rerun the test with a breakpoint for cuMemAlloc
instead of cudaMalloc
?
[...] With previous versions of torch, calls to
cuInit
andcudaMalloc
appear in both/usr/local/lib/libnvshare.so
If you redo the initial test, you'll notice that cudaMalloc
is not from libnvshare.so
, only cuInit
is.
Thank you for your answer ans sorry for the inconvenience.
Here is the output of dbg
with torch==2.1.0
and cuInit
and cuMemAlloc
breakpoints only :
(gdb) break cuInit
Breakpoint 1 at 0x7fffba8e6660 (2 locations)
(gdb) break cuMemAlloc
Breakpoint 2 at 0x7fffba93d8a0
(gdb) run
Starting program: /home/tarsicaud/.virtualenvs/torch2.1.0/bin/python pt1.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff4bc39700 (LWP 54157)]
[New Thread 0x7fff4b438700 (LWP 54158)]
[New Thread 0x7fff46c37700 (LWP 54159)]
[New Thread 0x7fff44436700 (LWP 54160)]
[New Thread 0x7fff43c35700 (LWP 54161)]
...
...
[New Thread 0x7ffecec07700 (LWP 54207)]
[New Thread 0x7ffecc406700 (LWP 54208)]
[New Thread 0x7ffec9c05700 (LWP 54209)]
[New Thread 0x7ffec7404700 (LWP 54210)]
[New Thread 0x7ffec4c03700 (LWP 54211)]
--Type <RET> for more, q to quit, c to continue without paging--c
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
[New Thread 0x7ffeb8773700 (LWP 54215)]
[New Thread 0x7ffeb6f8c700 (LWP 54216)]
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 1, 0x00007fffba8e6660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
[New Thread 0x7ffe97fff700 (LWP 54217)]
Finished
[Thread 0x7ffe97fff700 (LWP 54217) exited]
[Thread 0x7ffed8c0b700 (LWP 54203) exited]
[Thread 0x7ffec4c03700 (LWP 54211) exited]
[Thread 0x7ffec7404700 (LWP 54210) exited]
[Thread 0x7ffec9c05700 (LWP 54209) exited]
...
...
[Thread 0x7fff43c35700 (LWP 54161) exited]
[Thread 0x7fff44436700 (LWP 54160) exited]
[Thread 0x7fff46c37700 (LWP 54159) exited]
[Thread 0x7fff4b438700 (LWP 54158) exited]
[Thread 0x7fff4bc39700 (LWP 54157) exited]
--Type <RET> for more, q to quit, c to continue without paging--c
[Inferior 1 (process 54155) exited normally]
With the same breakpoints and torch==2.0.1
, I get :
gdb) break cuInit
Breakpoint 1 at 0x7fffc138b660 (2 locations)
(gdb) break cuMemAlloc
Breakpoint 2 at 0x7fffc13e28a0
(gdb) run
Starting program: /home/tarsicaud/.virtualenvs/torch2.0.1/bin/python pt1.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff5db35700 (LWP 53942)]
[New Thread 0x7fff5d334700 (LWP 53943)]
[New Thread 0x7fff5ab33700 (LWP 53944)]
[New Thread 0x7fff56332700 (LWP 53945)]
[New Thread 0x7fff55b31700 (LWP 53946)]
...
...
[New Thread 0x7ffee0b03700 (LWP 53992)]
[New Thread 0x7ffede302700 (LWP 53993)]
[New Thread 0x7ffedbb01700 (LWP 53994)]
[New Thread 0x7ffed9300700 (LWP 53995)]
[New Thread 0x7ffed6aff700 (LWP 53996)]
--Type <RET> for more, q to quit, c to continue without paging--c
Thread 1 "python" hit Breakpoint 1, 0x00007ffff7fc0d70 in cuInit () from /usr/local/lib/libnvshare.so
(gdb) c
Continuing.
[New Thread 0x7ffecf9c9700 (LWP 53997)]
[Switching to Thread 0x7ffecf9c9700 (LWP 53997)]
Thread 57 "python" hit Breakpoint 1, 0x00007fffc138b660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
[New Thread 0x7ffeceec7700 (LWP 54001)]
[NVSHARE][INFO]: Successfully initialized nvshare GPU
[NVSHARE][INFO]: Client ID = 8963afc6c067d18e
[New Thread 0x7ffece6c6700 (LWP 54002)]
[Switching to Thread 0x7ffff7be6b80 (LWP 53941)]
Thread 1 "python" hit Breakpoint 1, 0x00007fffc138b660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
[New Thread 0x7ffecdec5700 (LWP 54003)]
Thread 1 "python" hit Breakpoint 1, 0x00007ffff7fc0d70 in cuInit () from /usr/local/lib/libnvshare.so
(gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 1, 0x00007fffc138b660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 1, 0x00007ffff7fc0d70 in cuInit () from /usr/local/lib/libnvshare.so
(gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 1, 0x00007fffc138b660 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) c
Continuing.
[New Thread 0x7ffec1fff700 (LWP 54009)]
Finished
[Thread 0x7ffec1fff700 (LWP 54009) exited]
[Thread 0x7fff01310700 (LWP 53979) exited]
[Thread 0x7ffef9b0d700 (LWP 53982) exited]
[Thread 0x7ffed6aff700 (LWP 53996) exited]
[Thread 0x7ffed9300700 (LWP 53995) exited]
...
...
[Thread 0x7fff55b31700 (LWP 53946) exited]
[Thread 0x7fff56332700 (LWP 53945) exited]
[Thread 0x7fff5ab33700 (LWP 53944) exited]
[Thread 0x7fff5d334700 (LWP 53943) exited]
[Thread 0x7fff5db35700 (LWP 53942) exited]
--Type <RET> for more, q to quit, c to continue without paging--c
[Inferior 1 (process 53941) exited normally]
And yes of course you are right, when I put a breakpoint
on cudaMalloc
in the last test, cudaMalloc
only hits libcudart.so.11.0
, not libnvshare.so
.
Hmm, this is strange...
Let's take a step back and verify that the dynamic linker/loader indeed links libnvshare.so
into the pytorch 2.1.0 application.
Could you run LD_DEBUG=libs,symbols LD_PRELOAD=libnvshare.so python3 ...
and paste the logs here?
LD_DEBUG
is a special purpose environment variable that ld.so
(https://man7.org/linux/man-pages/man8/ld.so.8.html) reads and prints additional debug information.
In this case we want to examine:
ld.so
loadsHi,
Thank you for your answer.
I'm quite confused as the output of LD_DEBUG=libs,symbols CUDA_VISIBLE_DEVICES=0 LD_PRELOAD=/usr/local/lib/libnvshare.so python torch_app.py &> log.txt
gives a very long file (~ 900 MB).
The beginning of the log file contains :
183761: symbol=__vdso_clock_gettime; lookup in file=linux-vdso.so.1 [0]
183761: symbol=__vdso_gettimeofday; lookup in file=linux-vdso.so.1 [0]
183761: symbol=__vdso_time; lookup in file=linux-vdso.so.1 [0]
183761: symbol=__vdso_getcpu; lookup in file=linux-vdso.so.1 [0]
183761: symbol=__vdso_clock_getres; lookup in file=linux-vdso.so.1 [0]
183761: find library=libc.so.6 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libc.so.6
183761:
183761: find library=libpthread.so.0 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libpthread.so.0
183761:
183761: find library=libdl.so.2 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libdl.so.2
183761:
183761: find library=libutil.so.1 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libutil.so.1
183761:
183761: find library=libm.so.6 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libm.so.6
183761:
183761: find library=libexpat.so.1 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libexpat.so.1
183761:
183761: find library=libz.so.1 [0]; searching
183761: search cache=/etc/ld.so.cache
183761: trying file=/lib/x86_64-linux-gnu/libz.so.1
183761:
183761: symbol=_res; lookup in file=python [0]
183761: symbol=_res; lookup in file=/usr/local/lib/libnvshare.so [0]
183761: symbol=_res; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
183761: symbol=stderr; lookup in file=python [0]
183761: symbol=error_one_per_line; lookup in file=python [0]
183761: symbol=error_one_per_line; lookup in file=/usr/local/lib/libnvshare.so [0]
183761: symbol=error_one_per_line; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
183761: symbol=__morecore; lookup in file=python [0]
183761: symbol=__morecore; lookup in file=/usr/local/lib/libnvshare.so [0]
183761: symbol=__morecore; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
183761: symbol=__key_encryptsession_pk_LOCAL; lookup in file=python [0]
183761: symbol=__key_encryptsession_pk_LOCAL; lookup in file=/usr/local/lib/libnvshare.so [0]
Is there a way to filter / extract otherwise relevant information ?
I've tried something like cat log.txt | grep 'cuInit'
which gives only :
183761: symbol=real_cuInit; lookup in file=python [0]
183761: symbol=real_cuInit; lookup in file=/usr/local/lib/libnvshare.so [0]
183761: symbol=cuInit; lookup in file=python [0]
183761: symbol=cuInit; lookup in file=/usr/local/lib/libnvshare.so [0]
and cat log.txt | grep 'cuMemAlloc'
:
183761: symbol=real_cuMemAllocManaged; lookup in file=python [0]
183761: symbol=real_cuMemAllocManaged; lookup in file=/usr/local/lib/libnvshare.so [0]
183761: symbol=cuMemAlloc_v2; lookup in file=python [0]
183761: symbol=cuMemAlloc_v2; lookup in file=/usr/local/lib/libnvshare.so [0]
Hmmm, I didn't predict it would be this big. The problem is with the symbols
argument to LD_DEBUG
. It should have been bindings
instead.
To avoid having a single, huge log file, can you split the process in two steps?
For the Pytorch 2.1.0 and 2.0.1 applications:
LD_DEBUG=libs
. The log should be small. We want to ensure that linbnvshare.so
is loaded.LD_DEBUG=bindings
. The log will be bigger. We want to find out the shared library to which cuInit
and cuMemAlloc
are bound. You can grep for cuInit
and cuMemAlloc
, as you did in your previous comment.We're getting closer!
Hi,
TheLD_DEBUG=libs... python torch_app_2.1.0.py
still generates a quite long log (~ 1100 lines), which I can send you by email if you wish.
Refering to libnvshare
, cat libs-2.1.0.txt | grep 'nvshare'
gives :
196531: calling init: /usr/local/lib/libnvshare.so
196531: calling fini: /usr/local/lib/libnvshare.so [0]
Also, cat libs-2.1.0.txt | grep 'libcuda'
gives :
196531: find library=libcudart.so.12 [0]; searching
196531: trying file=/home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcudart.so.12
196531: trying file=/home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libcudart.so.12
196531: trying file=/home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libcudart.so.12
196531: trying file=/home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
196531: calling init: /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
196531: find library=libcuda.so.1 [0]; searching
196531: trying file=/home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcuda.so.1
196531: trying file=/lib/x86_64-linux-gnu/libcuda.so.1
196531: calling init: /lib/x86_64-linux-gnu/libcuda.so.1
196531: calling fini: /lib/x86_64-linux-gnu/libcuda.so.1 [0]
196531: calling fini: /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12 [0]
For bindings, cat bindings-2.1.0.txt | grep 'nvshare'
outputs :
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `sum_allocated'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_size_mem_allocatable'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_allocation_list'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `enable_single_oversub'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyAsync'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `client_fn'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoH'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoD'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoDAsync'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_check_interval'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `global_mutex'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `rsock'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `client_tid'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `own_lock'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemGetInfo'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoH_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `kern_since_sync'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvml_ok'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpy'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `scheduler_on'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuLaunchKernel'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_cv'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `need_lock'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemFree_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `message_type_string'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuLaunchKernel'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyHtoDAsync_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_client_id'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuInit'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemGetInfo_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyHtoD_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `own_lock_cv'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_ctx'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetErrorString'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `__debug'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetErrorName'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlInit'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoDAsync_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxGetCurrent'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuGetProcAddress'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvscheduler_socket_path'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `initialize_client'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyAsync'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `kcount_mutex'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `pending_kernel_window'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemFree'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpy'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxSetCurrent'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlDeviceGetUtilizationRates'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `did_work'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_fn'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_thread_tid'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoHAsync_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `req_lock_msg'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyHtoD'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `got_initial_sched_status'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemAllocManaged'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoHAsync'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemAlloc_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuInit'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlDeviceGetHandleByIndex'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxSynchronize'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoD_v2'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetProcAddress'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__cxa_finalize' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyHtoDAsync'
196669: binding file /usr/local/lib/libnvshare.so [0] to python [0]: normal symbol `stderr' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `free' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_create' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_sigmask' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__errno_location' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `unlink' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_broadcast' [GLIBC_2.3.2]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `clock_gettime' [GLIBC_2.17]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `write' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_wait' [GLIBC_2.3.2]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_once' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `write_whole'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fclose' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__stack_chk_fail' [GLIBC_2.4]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `accept4' [GLIBC_2.10]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `snprintf' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_driver_check_error'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `memset' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `close' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `read' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fgets' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `strcmp' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlvsym' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `read_whole'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_wait' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `sigfillset' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_init' [GLIBC_2.3.2]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlopen' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_unlock' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `malloc' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__isoc99_sscanf' [GLIBC_2.7]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `listen' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `strlcpy'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_post' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `continue_with_lock'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `bind' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_timedwait' [GLIBC_2.3.2]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_init' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fopen' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_get_scheduler_path'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_connect'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `exit' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `connect' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fwrite' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__fprintf_chk' [GLIBC_2.3.4]
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_receive_block'
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `strerror' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_init' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_lock' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `rand' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlerror' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `socket' [GLIBC_2.2.5]
196669: calling init: /usr/local/lib/libnvshare.so
196669: binding file /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file python [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/nvtx/lib/libnvToolsExt.so.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/curand/lib/libcurand.so.10 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cufft/lib/libcufft.so.11 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/../../nvjitlink/lib/libnvJitLink.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libcupti.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/libc10_cuda.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /home/username/.virtualenvs/torch2.1.0/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libnvrtc.so.12 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /lib/x86_64-linux-gnu/libcrypto.so.1.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: binding file /lib/x86_64-linux-gnu/libnvidia-ml.so.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196669: calling fini: /usr/local/lib/libnvshare.so [0]
cat bindings-2.1.0.txt | grep 'cuInit'
gives :
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuInit'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuInit'
and cat bindings-2.1.0.txt | grep 'cuMemAlloc'
:
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemAllocManaged'
196669: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemAlloc_v2'
For comparison, when doing the same tests with the torch 2.0.1 app, I get :
cat libs-2.0.1.txt | grep 'nvshare'
:
196934: calling init: /usr/local/lib/libnvshare.so
[NVSHARE][INFO]: Successfully initialized nvshare GPU
196934: calling fini: /usr/local/lib/libnvshare.so [0]
cat libs-2.0.1.txt | grep 'libcuda'
:
196934: find library=libcudart.so.11.0 [0]; searching
196934: trying file=/home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcudart.so.11.0
196934: trying file=/home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libcudart.so.11.0
196934: trying file=/home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libcudart.so.11.0
196934: trying file=/home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.11.0
196934: calling init: /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.11.0
196934: find library=libcuda.so.1 [0]; searching
196934: trying file=/home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcuda.so.1
196934: trying file=/lib/x86_64-linux-gnu/libcuda.so.1
196934: calling init: /lib/x86_64-linux-gnu/libcuda.so.1
196934: find library=libcuda.so [0]; searching
196934: trying file=/lib/x86_64-linux-gnu/libcuda.so
196934: calling fini: /lib/x86_64-linux-gnu/libcuda.so.1 [0]
196934: calling fini: /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.11.0 [0]
cat bindings-2.0.1.txt | grep 'nvshare'
:
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `sum_allocated'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_size_mem_allocatable'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_allocation_list'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `enable_single_oversub'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyAsync'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `client_fn'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoH'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoD'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoDAsync'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_check_interval'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `global_mutex'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `rsock'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `client_tid'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `own_lock'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemGetInfo'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoH_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `kern_since_sync'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvml_ok'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpy'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `scheduler_on'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuLaunchKernel'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_cv'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `need_lock'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemFree_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `message_type_string'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuLaunchKernel'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyHtoDAsync_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_client_id'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuInit'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemGetInfo_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyHtoD_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `own_lock_cv'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_ctx'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetErrorString'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `__debug'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetErrorName'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlInit'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoDAsync_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxGetCurrent'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuGetProcAddress'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvscheduler_socket_path'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `initialize_client'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyAsync'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `kcount_mutex'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `pending_kernel_window'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemFree'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpy'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxSetCurrent'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlDeviceGetUtilizationRates'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `did_work'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_fn'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `release_early_thread_tid'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoHAsync_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `req_lock_msg'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyHtoD'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `got_initial_sched_status'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemAllocManaged'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyDtoHAsync'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemAlloc_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuInit'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_nvmlDeviceGetHandleByIndex'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuCtxSynchronize'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemcpyDtoD_v2'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuGetProcAddress'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__cxa_finalize' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemcpyHtoDAsync'
196870: binding file /usr/local/lib/libnvshare.so [0] to python [0]: normal symbol `stderr' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `free' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_create' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_sigmask' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__errno_location' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `unlink' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_broadcast' [GLIBC_2.3.2]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `clock_gettime' [GLIBC_2.17]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `write' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_wait' [GLIBC_2.3.2]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_once' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `write_whole'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fclose' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__stack_chk_fail' [GLIBC_2.4]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `accept4' [GLIBC_2.10]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `snprintf' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuda_driver_check_error'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `memset' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `close' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `read' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fgets' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `strcmp' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlvsym' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `read_whole'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_wait' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `sigfillset' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_init' [GLIBC_2.3.2]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlopen' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_unlock' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `malloc' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__isoc99_sscanf' [GLIBC_2.7]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `listen' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `strlcpy'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_post' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `continue_with_lock'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `bind' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_cond_timedwait' [GLIBC_2.3.2]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `sem_init' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fopen' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_get_scheduler_path'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_connect'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `exit' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `connect' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `fwrite' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__fprintf_chk' [GLIBC_2.3.4]
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `nvshare_receive_block'
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `strerror' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_init' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_lock' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `rand' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlerror' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `socket' [GLIBC_2.2.5]
196870: calling init: /usr/local/lib/libnvshare.so
196870: binding file /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file python [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /usr/local/lib/libnvshare.so [0] to /lib/x86_64-linux-gnu/libdl.so.2 [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/nvtx/lib/libnvToolsExt.so.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.11.0 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.11 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cufft/lib/libcufft.so.10 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/curand/lib/libcurand.so.10 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.11 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libcupti.so.11.7 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /home/username/.virtualenvs/torch2.0.1/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /lib/x86_64-linux-gnu/libcrypto.so.1.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `dlsym' [GLIBC_2.2.5]
[NVSHARE][INFO]: Successfully initialized nvshare GPU
196870: calling fini: /usr/local/lib/libnvshare.so [0]
cat bindings-2.0.1.txt | grep 'cuInit'
:
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuInit'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuInit'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuInit'
cat bindings-2.0.1.txt | grep 'cuMemAlloc'
:
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `real_cuMemAllocManaged'
196870: binding file /usr/local/lib/libnvshare.so [0] to /usr/local/lib/libnvshare.so [0]: normal symbol `cuMemAlloc_v2'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocManaged'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocPitch_v2'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocAsync'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocAsync_ptsz'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocFromPoolAsync'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocFromPoolAsync_ptsz'
196870: binding file /lib/x86_64-linux-gnu/libcuda.so.1 [0] to /lib/x86_64-linux-gnu/libcuda.so.1 [0]: normal symbol `cuMemAllocManaged'
Thanks for taking the time to run these tests.
I'd like to take a look at the full logs (both for libs and bindings), so if you could mail them to me, or upload them to a public place, I'll happily take a look.
Also, I'd like you to rerun the gdb
tests on 2.1.0 and 2.0.1 with the following breakpoints set:
cuInit
cuMemAlloc
cuMemAlloc_v2
cuGetProcAddress
cuGetProcAddress_v2
I noticed that Pytorch 2.1.0 (from PyPI -- the one you have installed) comes with CUDA 12.x, while Pytorch 2.0.1 comes with CUDA 11.x. CUDA 12.0 introduced a new function, cuGetProcAddress_v2
, which we don't hook in nvshare
. We only hook the plain cuGetProcAddress
from CUDA 11.
To verify this, could you uninstall Pytorch 2.1.0 and re-install it with CUDA 11.8, following the official instructions [1]? (If you are using Conda, there are also instructions for that in the same link.)
Then, rerun the Pytorch 2.1.0 example and my prediction is that it will work.
[1] https://pytorch.org/get-started/previous-versions/#linux-and-windows-1
Yes, you are right !
In a cuda 12.2 environment, with torch installed with pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
, the nvshare manager is triggered as expected.
In the same cuda 12.2 environement, it is not when torch has been installed with only pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0
.
I've collected the full logs which you requested for the cuda 12.2 / cuda 12.2 scenario, I'll send you them by email.
For the gdb part, the symbols which are called in this scenario are, as you expected, cuInit
, cuGetProcAddress_v2
and cuMemAlloc_v2
.
Great job!
In order to support CUDA >=12 applications, we must also hook cuGetProcAddress_v2()
.
I will prepare (and merge) a PR tackling this when I get some time.
In the meantime, you can use the cu118
(CUDA 11.8) variant for PyTorch 2.1.0.
@grgalex
cuGetProcAddress
should serve as an entrypoint for the hook lib, implying that both initialize_libnvshare()
and initialize_client()
should be execute when cuGetProcAddress
is hooked for the first time.
Meanwhile, the definition of cuGetProcAddress
is different in CUDA 11 and CUDA 12, and some tricks may be needed to ensure compatibility.
@pokerfaceSad
Currently, we use cuInit()
as the trigger for initializing libnvshare
.
According the CUDA documentation, it is the only function that applications must necessarily call before using a GPU. In the case of applications that use the CUDA Runtime API, it internally calls cuInit()
for them.
cuGetProcAddress
and the _v2
variant are only called in apps that use the CUDA Runtime API, as it uses these functions to obtain the Driver API symbols.
Therefore, cuInit()
serves as a better entrypoint imo, and we are keeping it as such.
Regarding the differences between cuGetProcAddress
and the _v2
variant, I've taken a quick look at the docs and indeed the function prototype and usage is a bit different.
Do you want to point out something specfific about the approach we should take regarding the last argument of _v2
?
Perhaps you can experiment a bit with a CUDA 12.x Runtime API application and see how it uses the function.
@grgalex
When using the CUDA 11.4+ Runtime API for the first time in a user program, it will call cuGetProcAddress()
to get the cuInit()
and other driver API function pointers. Then cuInit()
will be called by the pointer obtained previously.
It means that cuGetProcAddress()
will be called before cuInit()
.
Therefore, cuGetProcAddress()
should also serve as an entrypoint. Otherwise, real_cuGetProcAddress
may be a NULL pointer when it is called.
cuGetProcAddress_v2
should also be defined in libnvshare with the additional argument for compatibility.
@pokerfaceSad
You are right. I had missed that!
By the way, do you want to prepare and send a PR for this?
I'm kinda busy at the moment, so I would really appreciate any help!
The suggested changes are (correct me if I'm wrong):
cuGetProcAddress_v2
. No #define
, as it's a distinct symbol we want to hook.
symbolStatus
to 0 (or define the enum in our header file and set it to CU_GET_PROC_ADDRESS_SUCCESS
.real_cuGetProcAddress_v2
will handle it.true_or_exit(pthread_once(&init_libnvshare_done, initialize_libnvshare) == 0);
call to both cuGetProcAddress
and *_v2
.@grgalex
OK, I will submit a PR for this :)
@t-arsicaud-catie
We just merged support for CUDA 12.
Feel free to deploy from the main
branch and hopefully it will work out of the box :)
Hi,
Sorry, I was unavailable for a while and could not test until now.
It's done, and I confirm that it works well, at least with the latest versions of pytorch / cuda.
thank you both for your work and the improvement of nvshare !
Hi,
I recently discovered that pytorch code such as the following :
which is, at execution time, registered and managed by
nvshare
withtorch==1.13.1
andtorch==2.0.1
, is not withtorch==2.1
.Code run as expected, accessing to the GPU defined in
CUDA_VISIBLE_DEVICES
, but directly, bypassing the controls made bynvshare
.My test environment is the following :
nvshare
compiled and installed following the recommendations in the READMECUDA_VISIBLE_DEVICES
andLD_PRELOAD
correctly setAny idea on the reason why, and is there a way to prevent this when
CUDA_VISIBLE_DEVICES
andLD_PRELOAD
are correctly set (innvshare
or the pytorch code) ?