tkestack / gpu-manager

Other
830 stars 235 forks source link

`nvidia-smi`找不到`libnvidia-ml.so` #101

Open cailun01 opened 3 years ago

cailun01 commented 3 years ago

创建vcuda这个pod之后,输入nvidia-smi报了找不到libnvidia-ml.so的错误:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

但是,我可以在pod中找到libnvidia-ml.so

find / -name libnvidia-ml.so
/usr/local/nvidia/lib/libnvidia-ml.so
/usr/local/nvidia/lib64/libnvidia-ml.so

运行ldconfig.real输出以下日志:

ldconfig.real: Can't link /usr/local/nvidia/lib/libnvidia-ml.so.1 to libnvidia-ml.so.450.66
ldconfig.real: Can't link /usr/local/nvidia/lib/libcuda.so.1 to libcuda.so.450.66
ldconfig.real: Can't link /usr/local/nvidia/lib64/libcuda.so.1 to libcuda.so.450.66
ldconfig.real: Can't link /usr/local/nvidia/lib64/libnvidia-ml.so.1 to libnvidia-ml.so.450.66

经在网上查询,有可能是显卡驱动的问题,但是我在host上运行nvidia-smi是可以正常输出的。

Fri Jun 11 14:34:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   40C    P8    17W / 250W |      0MiB / 11176MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

而且,在host卸载了驱动,然后重新安装,仍然不行。

mYmNeo commented 3 years ago
  1. Check LD_LIBRARY_PATH
  2. Doesn't support nvidia-docker, only runc Please confirm the two questions
cailun01 commented 3 years ago
  1. Check LD_LIBRARY_PATH
  2. Doesn't support nvidia-docker, only runc Please confirm the two questions
  1. LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:

    /usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib

    libnvidia-ml.so确实在以上路径中,但nvidia-smi还是找不到。 find / -name libnvidia-ml.so:

    /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
    /usr/local/nvidia/lib/libnvidia-ml.so
    /usr/local/nvidia/lib64/libnvidia-ml.so
  2. 我没有用nvidia-docker,daemon.json如下:

    
    {
    "log-level": "debug",
    "live-restore": true,
    "icc": false,
    "storage-driver": "overlay",
    "insecure-registries": ["qce-reg.nucpoc.com"],
    "live-restore": true,
    "log-driver": "json-file",
    "log-opts": {
    "max-size": "512m",
    "max-file": "3"
    }
    }
mYmNeo commented 3 years ago
  1. Check LD_LIBRARY_PATH
  2. Doesn't support nvidia-docker, only runc Please confirm the two questions
  1. LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:
/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib

libnvidia-ml.so确实在以上路径中,但nvidia-smi还是找不到。 find / -name libnvidia-ml.so:

/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/nvidia/lib/libnvidia-ml.so
/usr/local/nvidia/lib64/libnvidia-ml.so
  1. 我没有用nvidia-docker,daemon.json如下:
{
  "log-level": "debug",
  "live-restore": true,
  "icc": false,
  "storage-driver": "overlay",
  "insecure-registries": ["qce-reg.nucpoc.com"],
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "512m",
    "max-file": "3"
  }
}

Your last step of invoking ldconfig may ruin the library path. Please show the ldconfig -p and find whether /usr/local/nvidia/lib64 is in the result.

cailun01 commented 3 years ago
  1. Check LD_LIBRARY_PATH
  2. Doesn't support nvidia-docker, only runc Please confirm the two questions
  1. LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:
/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib

libnvidia-ml.so确实在以上路径中,但nvidia-smi还是找不到。 find / -name libnvidia-ml.so:

/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/nvidia/lib/libnvidia-ml.so
/usr/local/nvidia/lib64/libnvidia-ml.so
  1. 我没有用nvidia-docker,daemon.json如下:
{
  "log-level": "debug",
  "live-restore": true,
  "icc": false,
  "storage-driver": "overlay",
  "insecure-registries": ["qce-reg.nucpoc.com"],
  "live-restore": true,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "512m",
    "max-file": "3"
  }
}

Your last step of invoking ldconfig may ruin the library path. Please show the ldconfig -p and find whether /usr/local/nvidia/lib64 is in the result.

ldconfig -p结果如下,只有/usr/lib64/的库和/usr/local/cuda-10.1/targets/x86_64-linux/lib/,结果里没有/usr/local/nvidia/lib64中的库。

241 libs found in cache `/etc/ld.so.cache'
    p11-kit-trust.so (libc6,x86-64) => /lib64/p11-kit-trust.so
    libz.so.1 (libc6,x86-64) => /lib64/libz.so.1
    libxml2.so.2 (libc6,x86-64) => /lib64/libxml2.so.2
    libverto.so.1 (libc6,x86-64) => /lib64/libverto.so.1
    libuuid.so.1 (libc6,x86-64) => /lib64/libuuid.so.1
    libutil.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libutil.so.1
    libutil.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libutil.so
    libutempter.so.0 (libc6,x86-64) => /lib64/libutempter.so.0
    libustr-1.0.so.1 (libc6,x86-64) => /lib64/libustr-1.0.so.1
    libuser.so.1 (libc6,x86-64) => /lib64/libuser.so.1
    libudev.so.1 (libc6,x86-64) => /lib64/libudev.so.1
    libtinfo.so.5 (libc6,x86-64) => /lib64/libtinfo.so.5
    libtic.so.5 (libc6,x86-64) => /lib64/libtic.so.5
    libthread_db.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libthread_db.so.1
    libthread_db.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libthread_db.so
    libtasn1.so.6 (libc6,x86-64) => /lib64/libtasn1.so.6
    libsystemd.so.0 (libc6,x86-64) => /lib64/libsystemd.so.0
    libsystemd-login.so.0 (libc6,x86-64) => /lib64/libsystemd-login.so.0
    libsystemd-journal.so.0 (libc6,x86-64) => /lib64/libsystemd-journal.so.0
    libsystemd-id128.so.0 (libc6,x86-64) => /lib64/libsystemd-id128.so.0
    libsystemd-daemon.so.0 (libc6,x86-64) => /lib64/libsystemd-daemon.so.0
    libstdc++.so.6 (libc6,x86-64) => /lib64/libstdc++.so.6
    libssl3.so (libc6,x86-64) => /lib64/libssl3.so
    libssl.so.10 (libc6,x86-64) => /lib64/libssl.so.10
    libssh2.so.1 (libc6,x86-64) => /lib64/libssh2.so.1
    libsqlite3.so.0 (libc6,x86-64) => /lib64/libsqlite3.so.0
    libsoftokn3.so (libc6,x86-64) => /lib64/libsoftokn3.so
    libsmime3.so (libc6,x86-64) => /lib64/libsmime3.so
    libsmartcols.so.1 (libc6,x86-64) => /lib64/libsmartcols.so.1
    libslapi-2.4.so.2 (libc6,x86-64) => /lib64/libslapi-2.4.so.2
    libsepol.so.1 (libc6,x86-64) => /lib64/libsepol.so.1
    libsemanage.so.1 (libc6,x86-64) => /lib64/libsemanage.so.1
    libselinux.so.1 (libc6,x86-64) => /lib64/libselinux.so.1
    libsasl2.so.3 (libc6,x86-64) => /lib64/libsasl2.so.3
    librt.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/librt.so.1
    librt.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/librt.so
    librpmsign.so.1 (libc6,x86-64) => /lib64/librpmsign.so.1
    librpmio.so.3 (libc6,x86-64) => /lib64/librpmio.so.3
    librpmbuild.so.3 (libc6,x86-64) => /lib64/librpmbuild.so.3
    librpm.so.3 (libc6,x86-64) => /lib64/librpm.so.3
    libresolv.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libresolv.so.2
    libresolv.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libresolv.so
    libreadline.so.6 (libc6,x86-64) => /lib64/libreadline.so.6
    libqrencode.so.3 (libc6,x86-64) => /lib64/libqrencode.so.3
    libp11-kit.so.0 (libc6,x86-64) => /lib64/libp11-kit.so.0
    libpython2.7.so.1.0 (libc6,x86-64) => /lib64/libpython2.7.so.1.0
    libpwquality.so.1 (libc6,x86-64) => /lib64/libpwquality.so.1
    libpthread.so.0 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libpthread.so.0
    libpth.so.20 (libc6,x86-64) => /lib64/libpth.so.20
    libprocps.so.4 (libc6,x86-64) => /lib64/libprocps.so.4
    libpopt.so.0 (libc6,x86-64) => /lib64/libpopt.so.0
    libplds4.so (libc6,x86-64) => /lib64/libplds4.so
    libplc4.so (libc6,x86-64) => /lib64/libplc4.so
    libpcre32.so.0 (libc6,x86-64) => /lib64/libpcre32.so.0
    libpcre16.so.0 (libc6,x86-64) => /lib64/libpcre16.so.0
    libpcreposix.so.0 (libc6,x86-64) => /lib64/libpcreposix.so.0
    libpcrecpp.so.0 (libc6,x86-64) => /lib64/libpcrecpp.so.0
    libpcre.so.1 (libc6,x86-64) => /lib64/libpcre.so.1
    libpcprofile.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libpcprofile.so
    libpanelw.so.5 (libc6,x86-64) => /lib64/libpanelw.so.5
    libpanel.so.5 (libc6,x86-64) => /lib64/libpanel.so.5
    libpamc.so.0 (libc6,x86-64) => /lib64/libpamc.so.0
    libpam_misc.so.0 (libc6,x86-64) => /lib64/libpam_misc.so.0
    libpam.so.0 (libc6,x86-64) => /lib64/libpam.so.0
    libopcodes-2.27-44.base.el7.so (libc6,x86-64) => /lib64/libopcodes-2.27-44.base.el7.so
    libnvrtc.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc.so.10.1
    libnvrtc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc.so
    libnvrtc-builtins.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc-builtins.so.10.1
    libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc-builtins.so
    libnvjpeg.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvjpeg.so.10
    libnvjpeg.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvjpeg.so
    libnvgraph.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvgraph.so.10
    libnvgraph.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvgraph.so
    libnvblas.so.10 (libc6,x86-64) => /lib64/libnvblas.so.10
    libnvblas.so (libc6,x86-64) => /lib64/libnvblas.so
    libnvToolsExt.so.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvToolsExt.so.1
    libnvToolsExt.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvToolsExt.so
    libnss3.so (libc6,x86-64) => /lib64/libnss3.so
    libnssutil3.so (libc6,x86-64) => /lib64/libnssutil3.so
    libnsssysinit.so (libc6,x86-64) => /lib64/libnsssysinit.so
    libnsspem.so (libc6,x86-64) => /lib64/libnsspem.so
    libnssdbm3.so (libc6,x86-64) => /lib64/libnssdbm3.so
    libnss_nisplus.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nisplus.so.2
    libnss_nisplus.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nisplus.so
    libnss_nis.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nis.so.2
    libnss_nis.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nis.so
    libnss_mymachines.so.2 (libc6,x86-64) => /lib64/libnss_mymachines.so.2
    libnss_myhostname.so.2 (libc6,x86-64) => /lib64/libnss_myhostname.so.2
    libnss_hesiod.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_hesiod.so.2
    libnss_hesiod.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_hesiod.so
    libnss_files.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_files.so.2
    libnss_files.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_files.so
    libnss_dns.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_dns.so.2
    libnss_dns.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_dns.so
    libnss_db.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_db.so.2
    libnss_db.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_db.so
    libnss_compat.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_compat.so.2
    libnss_compat.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_compat.so
    libnspr4.so (libc6,x86-64) => /lib64/libnspr4.so
    libnsl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnsl.so.1
    libnsl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnsl.so
    libnpps.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnpps.so.10
    libnpps.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnpps.so
    libnppitc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppitc.so.10
    libnppitc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppitc.so
    libnppisu.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppisu.so.10
    libnppisu.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppisu.so
    libnppist.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppist.so.10
    libnppist.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppist.so
    libnppim.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppim.so.10
    libnppim.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppim.so
    libnppig.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppig.so.10
    libnppig.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppig.so
    libnppif.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppif.so.10
    libnppif.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppif.so
    libnppidei.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppidei.so.10
    libnppidei.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppidei.so
    libnppicom.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicom.so.10
    libnppicom.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicom.so
    libnppicc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicc.so.10
    libnppicc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicc.so
    libnppial.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppial.so.10
    libnppial.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppial.so
    libnppc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppc.so.10
    libnppc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppc.so
    libncursesw.so.5 (libc6,x86-64) => /lib64/libncursesw.so.5
    libncurses.so.5 (libc6,x86-64) => /lib64/libncurses.so.5
    libncurses++w.so.5 (libc6,x86-64) => /lib64/libncurses++w.so.5
    libncurses++.so.5 (libc6,x86-64) => /lib64/libncurses++.so.5
    libnccl.so.2 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnccl.so.2
    libnccl.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnccl.so
    libmpfr.so.4 (libc6,x86-64) => /lib64/libmpfr.so.4
    libmpc.so.3 (libc6,x86-64) => /lib64/libmpc.so.3
    libmount.so.1 (libc6,x86-64) => /lib64/libmount.so.1
    libmenuw.so.5 (libc6,x86-64) => /lib64/libmenuw.so.5
    libmenu.so.5 (libc6,x86-64) => /lib64/libmenu.so.5
    libmemusage.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libmemusage.so
    libmagic.so.1 (libc6,x86-64) => /lib64/libmagic.so.1
    libm.so.6 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libm.so.6
    libm.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libm.so
    liblz4.so.1 (libc6,x86-64) => /lib64/liblz4.so.1
    liblzma.so.5 (libc6,x86-64) => /lib64/liblzma.so.5
    liblua-5.1.so (libc6,x86-64) => /lib64/liblua-5.1.so
    libldap_r-2.4.so.2 (libc6,x86-64) => /lib64/libldap_r-2.4.so.2
    libldap-2.4.so.2 (libc6,x86-64) => /lib64/libldap-2.4.so.2
    liblber-2.4.so.2 (libc6,x86-64) => /lib64/liblber-2.4.so.2
    libk5crypto.so.3 (libc6,x86-64) => /lib64/libk5crypto.so.3
    libkrb5support.so.0 (libc6,x86-64) => /lib64/libkrb5support.so.0
    libkrb5.so.3 (libc6,x86-64) => /lib64/libkrb5.so.3
    libkrad.so.0 (libc6,x86-64) => /lib64/libkrad.so.0
    libkmod.so.2 (libc6,x86-64) => /lib64/libkmod.so.2
    libkeyutils.so.1 (libc6,x86-64) => /lib64/libkeyutils.so.1
    libkdb5.so.8 (libc6,x86-64) => /lib64/libkdb5.so.8
    libjson.so.0 (libc6,x86-64) => /lib64/libjson.so.0
    libjson-c.so.2 (libc6,x86-64) => /lib64/libjson-c.so.2
    libidn.so.11 (libc6,x86-64) => /lib64/libidn.so.11
    libhistory.so.6 (libc6,x86-64) => /lib64/libhistory.so.6
    libgthread-2.0.so.0 (libc6,x86-64) => /lib64/libgthread-2.0.so.0
    libgssrpc.so.4 (libc6,x86-64) => /lib64/libgssrpc.so.4
    libgssapi_krb5.so.2 (libc6,x86-64) => /lib64/libgssapi_krb5.so.2
    libgpgme.so.11 (libc6,x86-64) => /lib64/libgpgme.so.11
    libgpgme-pthread.so.11 (libc6,x86-64) => /lib64/libgpgme-pthread.so.11
    libgpg-error.so.0 (libc6,x86-64) => /lib64/libgpg-error.so.0
    libgomp.so.1 (libc6,x86-64) => /lib64/libgomp.so.1
    libgobject-2.0.so.0 (libc6,x86-64) => /lib64/libgobject-2.0.so.0
    libgmpxx.so.4 (libc6,x86-64) => /lib64/libgmpxx.so.4
    libgmp.so.10 (libc6,x86-64) => /lib64/libgmp.so.10
    libgmodule-2.0.so.0 (libc6,x86-64) => /lib64/libgmodule-2.0.so.0
    libglib-2.0.so.0 (libc6,x86-64) => /lib64/libglib-2.0.so.0
    libgirepository-1.0.so.1 (libc6,x86-64) => /lib64/libgirepository-1.0.so.1
    libgio-2.0.so.0 (libc6,x86-64) => /lib64/libgio-2.0.so.0
    libgdbm_compat.so.4 (libc6,x86-64) => /lib64/libgdbm_compat.so.4
    libgdbm.so.4 (libc6,x86-64) => /lib64/libgdbm.so.4
    libgcrypt.so.11 (libc6,x86-64) => /lib64/libgcrypt.so.11
    libgcc_s.so.1 (libc6,x86-64) => /lib64/libgcc_s.so.1
    libfreebl3.so (libc6,x86-64) => /lib64/libfreebl3.so
    libfreeblpriv3.so (libc6,x86-64) => /lib64/libfreeblpriv3.so
    libformw.so.5 (libc6,x86-64) => /lib64/libformw.so.5
    libform.so.5 (libc6,x86-64) => /lib64/libform.so.5
    libffi.so.6 (libc6,x86-64) => /lib64/libffi.so.6
    libexpat.so.1 (libc6,x86-64) => /lib64/libexpat.so.1
    libelf.so.1 (libc6,x86-64) => /lib64/libelf.so.1
    libdw.so.1 (libc6,x86-64) => /lib64/libdw.so.1
    libdl.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libdl.so.2
    libdl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libdl.so
    libdevmapper.so.1.02 (libc6,x86-64) => /lib64/libdevmapper.so.1.02
    libdbus-1.so.3 (libc6,x86-64) => /lib64/libdbus-1.so.3
    libdbus-glib-1.so.2 (libc6,x86-64) => /lib64/libdbus-glib-1.so.2
    libdb-5.3.so (libc6,x86-64) => /lib64/libdb-5.3.so
    libcusparse.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusparse.so.10
    libcusparse.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusparse.so
    libcusolverMg.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolverMg.so.10
    libcusolverMg.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolverMg.so
    libcusolver.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolver.so.10
    libcusolver.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolver.so
    libcurl.so.4 (libc6,x86-64) => /lib64/libcurl.so.4
    libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcurand.so.10
    libcurand.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcurand.so
    libcuinj64.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcuinj64.so.10.1
    libcuinj64.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcuinj64.so
    libcufftw.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufftw.so.10
    libcufftw.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufftw.so
    libcufft.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft.so.10
    libcufft.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft.so
    libcudart.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1
    libcudart.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so
    libcublasLt.so.10 (libc6,x86-64) => /lib64/libcublasLt.so.10
    libcublasLt.so (libc6,x86-64) => /lib64/libcublasLt.so
    libcublas.so.10 (libc6,x86-64) => /lib64/libcublas.so.10
    libcublas.so (libc6,x86-64) => /lib64/libcublas.so
    libcryptsetup.so.12 (libc6,x86-64) => /lib64/libcryptsetup.so.12
    libcryptsetup.so.4 (libc6,x86-64) => /lib64/libcryptsetup.so.4
    libcrypto.so.10 (libc6,x86-64) => /lib64/libcrypto.so.10
    libcrypt.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcrypt.so.1
    libcrypt.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcrypt.so
    libcrack.so.2 (libc6,x86-64) => /lib64/libcrack.so.2
    libcom_err.so.2 (libc6,x86-64) => /lib64/libcom_err.so.2
    libcidn.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcidn.so.1
    libcidn.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcidn.so
    libcap.so.2 (libc6,x86-64) => /lib64/libcap.so.2
    libcap-ng.so.0 (libc6,x86-64) => /lib64/libcap-ng.so.0
    libc.so.6 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libc.so.6
    libbz2.so.1 (libc6,x86-64) => /lib64/libbz2.so.1
    libblkid.so.1 (libc6,x86-64) => /lib64/libblkid.so.1
    libbfd-2.27-44.base.el7.so (libc6,x86-64) => /lib64/libbfd-2.27-44.base.el7.so
    libauparse.so.0 (libc6,x86-64) => /lib64/libauparse.so.0
    libaudit.so.1 (libc6,x86-64) => /lib64/libaudit.so.1
    libattr.so.1 (libc6,x86-64) => /lib64/libattr.so.1
    libassuan.so.0 (libc6,x86-64) => /lib64/libassuan.so.0
    libasm.so.1 (libc6,x86-64) => /lib64/libasm.so.1
    libanl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libanl.so.1
    libanl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libanl.so
    libacl.so.1 (libc6,x86-64) => /lib64/libacl.so.1
    libaccinj64.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libaccinj64.so.10.1
    libaccinj64.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libaccinj64.so
    libSegFault.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libSegFault.so
    libOpenCL.so.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libOpenCL.so.1
    libOpenCL.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libOpenCL.so
    libBrokenLocale.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libBrokenLocale.so.1
    libBrokenLocale.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libBrokenLocale.so
    ld-linux-x86-64.so.2 (libc6,x86-64) => /lib64/ld-linux-x86-64.so.2
mYmNeo commented 3 years ago

Is there any dead symbol link which named libnvidia-ml.so or libnvidia-ml.so.1 in your image? If so, remove them.

cailun01 commented 3 years ago

Is there any dead symbol link which named libnvidia-ml.so or libnvidia-ml.so.1 in your image? If so, remove them.

我用find . -xtype lsymlinks .都没有发现失效的符号链接。

我发现一个现象,我在host上的驱动安装目录/usr/lib64查找libnvidia-ml,发现有2个链接(libnvidia-ml.so, libnvidia-ml.so.1)和1个动态库(libnvidia-ml.so.450.66):

lrwxrwxrwx   1 root root          17 Jun 11 10:29 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx   1 root root          22 Jun 11 10:29 libnvidia-ml.so.1 -> libnvidia-ml.so.450.66
-rwxr-xr-x   1 root root     1905848 Jun 11 10:29 libnvidia-ml.so.450.66

而在gpu-manager的相关目录/etc/gpu-manager/vdriver/nvidia/lib64中查找libnvidia-ml,发现没有libnvidia-ml.so.1,只有libnvidia-ml.so这个链接和2个动态库(libnvidia-ml.so.450.66libnvidia-ml.so.440.36):

lrwxrwxrwx 1 root root       22 Jun 15 14:21 libnvidia-ml.so -> libnvidia-ml.so.450.66
-rwxr-xr-x 1 root root  1465752 Jun  2 14:15 libnvidia-ml.so.440.36
-rwxr-xr-x 1 root root  1905848 Jun 15 14:21 libnvidia-ml.so.450.66

其中libnvidia-ml.so.440.36这个动态库对应旧版本的驱动,在我的节点上应该已经删了。

是不是没有libnvidia-ml.so.1导致的?

cailun01 commented 3 years ago

另外,有一个迂回的办法,用LD_PRELOAD这个环境变量手动加载动态库,就可以解决问题: # LD_PRELOAD=/usr/local/nvidia/lib64/libnvidia-ml.so nvidia-smi

Thu Jun 17 06:46:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   40C    P8    17W / 250W |      0MiB / 11176MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

所以动态库应该是有效的,只是不知道为什么nvidia-smi这个命令找不到它。

mYmNeo commented 3 years ago

nvidia-smi try to dlopen libnvidia-ml.so.1, What's the version of your gpu-manager?

cailun01 commented 3 years ago

nvidia-smi try to dlopen libnvidia-ml.so.1, What's the version of your gpu-manager?

我在master分支拉取的代码,生成的gpu-manager镜像是1.1.4:

REPOSITORY                       TAG                  IMAGE ID            CREATED             SIZE
tkestack/gpu-manager             1.1.4                0a74a803da06        2 days ago          10.8 GB
mYmNeo commented 3 years ago

Please provides logs that contains Mirror %s to %s and Vcuda %s to %s

cailun01 commented 3 years ago

Please provides logs that contains Mirror %s to %s and `Vcuda %s

请问这是什么log?应该如何获取?

swartz-k commented 3 years ago

可以尝试检查文件 /etc/nvidia-container-runtime/config.toml中的 ldconfig值,默认是 "@/sbin/ldconfig" 改为 "/sbin/ldconfig" 试试?

cailun01 commented 3 years ago

可以尝试检查文件 /etc/nvidia-container-runtime/config.toml中的 ldconfig值,默认是 "@/sbin/ldconfig" 改为 "/sbin/ldconfig" 试试?

感谢回复!GPU manager没有使用nvidia docker,所以在/etc/nvidia-container-runtime/config.toml检查也是没有效果的。

hzliangbin commented 2 years ago

Please provides logs that contains Mirror %s to %s and `Vcuda %s

请问这是什么log?应该如何获取?

gpu-manager的日志,调高日志级别可以看到

jxfruit commented 11 months ago

@mYmNeo 大佬,请问一下,我遇到的情况跟cailun01类似,而且我的pod里在/dev/下面还找不到nvidia的设备,我用的是最新版本的gpu-manager image

jxfruit commented 11 months ago

有没有其他大佬知道怎么解决的啊