tkestack / vcuda-controller

Other
488 stars 156 forks source link

need update for cuda11.4? #17

Closed difenbei closed 1 year ago

difenbei commented 2 years ago

I tried to use vcuda on Driver Version: 470.57.02, the program may fail without warning. Does it need to be updated for cuda11.4?Thanks!

mYmNeo commented 2 years ago

Please provide the vcuda-controller log. About how to dump log, please see the FAQ of gpu-manager

hzliangbin commented 1 year ago

@mYmNeo hi, I followed the faq to set the env, but still not get the vcuda-controller log, should set the env in POD used gpu card?

hzliangbin commented 1 year ago

@difenbei ran into same problem, did u solve it? logs are below.

/tmp/cuda-control/src/loader.c:1102 config file: /etc/vcuda/7eadf10c1933050f72f33123c4013720907258d292e4695bbcc0732b2afa2405/vcuda.config /tmp/cuda-control/src/loader.c:1103 pid file: /etc/vcuda/7eadf10c1933050f72f33123c4013720907258d292e4695bbcc0732b2afa2405/pids.config /tmp/cuda-control/src/loader.c:1107 register to remote: pod uid: ad51fa3f-4b64-11ed-98e3-00163e144b97, cont id: 7eadf10c1933050f72f33123c4013720907258d292e4695bbcc0732b2afa2405 /tmp/cuda-control/src/loader.c:1205 pod uid : ad51fa3f-4b64-11ed-98e3-00163e144b97 /tmp/cuda-control/src/loader.c:1206 limit : 0 /tmp/cuda-control/src/loader.c:1207 container name : tensorflow-test /tmp/cuda-control/src/loader.c:1208 total utilization: 30 /tmp/cuda-control/src/loader.c:1209 total gpu memory : 4294967296 /tmp/cuda-control/src/loader.c:1210 driver version : 470.57.02 /tmp/cuda-control/src/loader.c:1211 hard limit mode : 1 /tmp/cuda-control/src/loader.c:1212 enable mode : 1 /tmp/cuda-control/src/loader.c:913 Start hijacking /tmp/cuda-control/src/loader.c:929 can't find function cuEGLInit in libcuda.so.470.57.02 /tmp/cuda-control/src/loader.c:876 can't find function nvmlDeviceGetBusType in libnvidia-ml.so.470.57.02 /tmp/cuda-control/src/loader.c:876 can't find function nvmlDeviceGetIrqNum in libnvidia-ml.so.470.57.02 /tmp/cuda-control/src/loader.c:876 can't find function nvmlVgpuInstanceGetLicenseInfo in libnvidia-ml.so.470.57.02 /tmp/cuda-control/src/loader.c:883 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:466 cuInit error unknown error

but it was tested ok with driver version 460.32.03

mYmNeo commented 1 year ago

Did you reboot your machine after upgrading your driver?

hzliangbin commented 1 year ago

Did you reboot your machine after upgrading your driver?

thx,that‘s the point. After I reboot the machine, it works.

mYmNeo commented 1 year ago

30