Open cailun01 opened 3 years ago
- Check LD_LIBRARY_PATH
- Doesn't support nvidia-docker, only runc Please confirm the two questions
LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:
/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib
libnvidia-ml.so
确实在以上路径中,但nvidia-smi还是找不到。
find / -name libnvidia-ml.so
:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/nvidia/lib/libnvidia-ml.so
/usr/local/nvidia/lib64/libnvidia-ml.so
我没有用nvidia-docker,daemon.json如下:
{
"log-level": "debug",
"live-restore": true,
"icc": false,
"storage-driver": "overlay",
"insecure-registries": ["qce-reg.nucpoc.com"],
"live-restore": true,
"log-driver": "json-file",
"log-opts": {
"max-size": "512m",
"max-file": "3"
}
}
- Check LD_LIBRARY_PATH
- Doesn't support nvidia-docker, only runc Please confirm the two questions
- LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:
/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib
libnvidia-ml.so
确实在以上路径中,但nvidia-smi还是找不到。find / -name libnvidia-ml.so
:/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/nvidia/lib/libnvidia-ml.so /usr/local/nvidia/lib64/libnvidia-ml.so
- 我没有用nvidia-docker,daemon.json如下:
{ "log-level": "debug", "live-restore": true, "icc": false, "storage-driver": "overlay", "insecure-registries": ["qce-reg.nucpoc.com"], "live-restore": true, "log-driver": "json-file", "log-opts": { "max-size": "512m", "max-file": "3" } }
Your last step of invoking ldconfig
may ruin the library path. Please show the ldconfig -p
and find whether /usr/local/nvidia/lib64
is in the result.
- Check LD_LIBRARY_PATH
- Doesn't support nvidia-docker, only runc Please confirm the two questions
- LD_LIBRARY_PATH内包含了libnvidia-ml.so所在路径,echo $LD_LIBRARY_PATH:
/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs:/usr/local/nvidia/lib
libnvidia-ml.so
确实在以上路径中,但nvidia-smi还是找不到。find / -name libnvidia-ml.so
:/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/nvidia/lib/libnvidia-ml.so /usr/local/nvidia/lib64/libnvidia-ml.so
- 我没有用nvidia-docker,daemon.json如下:
{ "log-level": "debug", "live-restore": true, "icc": false, "storage-driver": "overlay", "insecure-registries": ["qce-reg.nucpoc.com"], "live-restore": true, "log-driver": "json-file", "log-opts": { "max-size": "512m", "max-file": "3" } }
Your last step of invoking
ldconfig
may ruin the library path. Please show theldconfig -p
and find whether/usr/local/nvidia/lib64
is in the result.
ldconfig -p
结果如下,只有/usr/lib64/
的库和/usr/local/cuda-10.1/targets/x86_64-linux/lib/
,结果里没有/usr/local/nvidia/lib64
中的库。
241 libs found in cache `/etc/ld.so.cache'
p11-kit-trust.so (libc6,x86-64) => /lib64/p11-kit-trust.so
libz.so.1 (libc6,x86-64) => /lib64/libz.so.1
libxml2.so.2 (libc6,x86-64) => /lib64/libxml2.so.2
libverto.so.1 (libc6,x86-64) => /lib64/libverto.so.1
libuuid.so.1 (libc6,x86-64) => /lib64/libuuid.so.1
libutil.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libutil.so.1
libutil.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libutil.so
libutempter.so.0 (libc6,x86-64) => /lib64/libutempter.so.0
libustr-1.0.so.1 (libc6,x86-64) => /lib64/libustr-1.0.so.1
libuser.so.1 (libc6,x86-64) => /lib64/libuser.so.1
libudev.so.1 (libc6,x86-64) => /lib64/libudev.so.1
libtinfo.so.5 (libc6,x86-64) => /lib64/libtinfo.so.5
libtic.so.5 (libc6,x86-64) => /lib64/libtic.so.5
libthread_db.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libthread_db.so.1
libthread_db.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libthread_db.so
libtasn1.so.6 (libc6,x86-64) => /lib64/libtasn1.so.6
libsystemd.so.0 (libc6,x86-64) => /lib64/libsystemd.so.0
libsystemd-login.so.0 (libc6,x86-64) => /lib64/libsystemd-login.so.0
libsystemd-journal.so.0 (libc6,x86-64) => /lib64/libsystemd-journal.so.0
libsystemd-id128.so.0 (libc6,x86-64) => /lib64/libsystemd-id128.so.0
libsystemd-daemon.so.0 (libc6,x86-64) => /lib64/libsystemd-daemon.so.0
libstdc++.so.6 (libc6,x86-64) => /lib64/libstdc++.so.6
libssl3.so (libc6,x86-64) => /lib64/libssl3.so
libssl.so.10 (libc6,x86-64) => /lib64/libssl.so.10
libssh2.so.1 (libc6,x86-64) => /lib64/libssh2.so.1
libsqlite3.so.0 (libc6,x86-64) => /lib64/libsqlite3.so.0
libsoftokn3.so (libc6,x86-64) => /lib64/libsoftokn3.so
libsmime3.so (libc6,x86-64) => /lib64/libsmime3.so
libsmartcols.so.1 (libc6,x86-64) => /lib64/libsmartcols.so.1
libslapi-2.4.so.2 (libc6,x86-64) => /lib64/libslapi-2.4.so.2
libsepol.so.1 (libc6,x86-64) => /lib64/libsepol.so.1
libsemanage.so.1 (libc6,x86-64) => /lib64/libsemanage.so.1
libselinux.so.1 (libc6,x86-64) => /lib64/libselinux.so.1
libsasl2.so.3 (libc6,x86-64) => /lib64/libsasl2.so.3
librt.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/librt.so.1
librt.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/librt.so
librpmsign.so.1 (libc6,x86-64) => /lib64/librpmsign.so.1
librpmio.so.3 (libc6,x86-64) => /lib64/librpmio.so.3
librpmbuild.so.3 (libc6,x86-64) => /lib64/librpmbuild.so.3
librpm.so.3 (libc6,x86-64) => /lib64/librpm.so.3
libresolv.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libresolv.so.2
libresolv.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libresolv.so
libreadline.so.6 (libc6,x86-64) => /lib64/libreadline.so.6
libqrencode.so.3 (libc6,x86-64) => /lib64/libqrencode.so.3
libp11-kit.so.0 (libc6,x86-64) => /lib64/libp11-kit.so.0
libpython2.7.so.1.0 (libc6,x86-64) => /lib64/libpython2.7.so.1.0
libpwquality.so.1 (libc6,x86-64) => /lib64/libpwquality.so.1
libpthread.so.0 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libpthread.so.0
libpth.so.20 (libc6,x86-64) => /lib64/libpth.so.20
libprocps.so.4 (libc6,x86-64) => /lib64/libprocps.so.4
libpopt.so.0 (libc6,x86-64) => /lib64/libpopt.so.0
libplds4.so (libc6,x86-64) => /lib64/libplds4.so
libplc4.so (libc6,x86-64) => /lib64/libplc4.so
libpcre32.so.0 (libc6,x86-64) => /lib64/libpcre32.so.0
libpcre16.so.0 (libc6,x86-64) => /lib64/libpcre16.so.0
libpcreposix.so.0 (libc6,x86-64) => /lib64/libpcreposix.so.0
libpcrecpp.so.0 (libc6,x86-64) => /lib64/libpcrecpp.so.0
libpcre.so.1 (libc6,x86-64) => /lib64/libpcre.so.1
libpcprofile.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libpcprofile.so
libpanelw.so.5 (libc6,x86-64) => /lib64/libpanelw.so.5
libpanel.so.5 (libc6,x86-64) => /lib64/libpanel.so.5
libpamc.so.0 (libc6,x86-64) => /lib64/libpamc.so.0
libpam_misc.so.0 (libc6,x86-64) => /lib64/libpam_misc.so.0
libpam.so.0 (libc6,x86-64) => /lib64/libpam.so.0
libopcodes-2.27-44.base.el7.so (libc6,x86-64) => /lib64/libopcodes-2.27-44.base.el7.so
libnvrtc.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc.so.10.1
libnvrtc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc.so
libnvrtc-builtins.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc-builtins.so.10.1
libnvrtc-builtins.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvrtc-builtins.so
libnvjpeg.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvjpeg.so.10
libnvjpeg.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvjpeg.so
libnvgraph.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvgraph.so.10
libnvgraph.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvgraph.so
libnvblas.so.10 (libc6,x86-64) => /lib64/libnvblas.so.10
libnvblas.so (libc6,x86-64) => /lib64/libnvblas.so
libnvToolsExt.so.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvToolsExt.so.1
libnvToolsExt.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnvToolsExt.so
libnss3.so (libc6,x86-64) => /lib64/libnss3.so
libnssutil3.so (libc6,x86-64) => /lib64/libnssutil3.so
libnsssysinit.so (libc6,x86-64) => /lib64/libnsssysinit.so
libnsspem.so (libc6,x86-64) => /lib64/libnsspem.so
libnssdbm3.so (libc6,x86-64) => /lib64/libnssdbm3.so
libnss_nisplus.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nisplus.so.2
libnss_nisplus.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nisplus.so
libnss_nis.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nis.so.2
libnss_nis.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_nis.so
libnss_mymachines.so.2 (libc6,x86-64) => /lib64/libnss_mymachines.so.2
libnss_myhostname.so.2 (libc6,x86-64) => /lib64/libnss_myhostname.so.2
libnss_hesiod.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_hesiod.so.2
libnss_hesiod.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_hesiod.so
libnss_files.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_files.so.2
libnss_files.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_files.so
libnss_dns.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_dns.so.2
libnss_dns.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_dns.so
libnss_db.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_db.so.2
libnss_db.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_db.so
libnss_compat.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_compat.so.2
libnss_compat.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnss_compat.so
libnspr4.so (libc6,x86-64) => /lib64/libnspr4.so
libnsl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnsl.so.1
libnsl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libnsl.so
libnpps.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnpps.so.10
libnpps.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnpps.so
libnppitc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppitc.so.10
libnppitc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppitc.so
libnppisu.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppisu.so.10
libnppisu.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppisu.so
libnppist.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppist.so.10
libnppist.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppist.so
libnppim.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppim.so.10
libnppim.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppim.so
libnppig.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppig.so.10
libnppig.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppig.so
libnppif.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppif.so.10
libnppif.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppif.so
libnppidei.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppidei.so.10
libnppidei.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppidei.so
libnppicom.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicom.so.10
libnppicom.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicom.so
libnppicc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicc.so.10
libnppicc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppicc.so
libnppial.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppial.so.10
libnppial.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppial.so
libnppc.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppc.so.10
libnppc.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnppc.so
libncursesw.so.5 (libc6,x86-64) => /lib64/libncursesw.so.5
libncurses.so.5 (libc6,x86-64) => /lib64/libncurses.so.5
libncurses++w.so.5 (libc6,x86-64) => /lib64/libncurses++w.so.5
libncurses++.so.5 (libc6,x86-64) => /lib64/libncurses++.so.5
libnccl.so.2 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnccl.so.2
libnccl.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libnccl.so
libmpfr.so.4 (libc6,x86-64) => /lib64/libmpfr.so.4
libmpc.so.3 (libc6,x86-64) => /lib64/libmpc.so.3
libmount.so.1 (libc6,x86-64) => /lib64/libmount.so.1
libmenuw.so.5 (libc6,x86-64) => /lib64/libmenuw.so.5
libmenu.so.5 (libc6,x86-64) => /lib64/libmenu.so.5
libmemusage.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libmemusage.so
libmagic.so.1 (libc6,x86-64) => /lib64/libmagic.so.1
libm.so.6 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libm.so.6
libm.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libm.so
liblz4.so.1 (libc6,x86-64) => /lib64/liblz4.so.1
liblzma.so.5 (libc6,x86-64) => /lib64/liblzma.so.5
liblua-5.1.so (libc6,x86-64) => /lib64/liblua-5.1.so
libldap_r-2.4.so.2 (libc6,x86-64) => /lib64/libldap_r-2.4.so.2
libldap-2.4.so.2 (libc6,x86-64) => /lib64/libldap-2.4.so.2
liblber-2.4.so.2 (libc6,x86-64) => /lib64/liblber-2.4.so.2
libk5crypto.so.3 (libc6,x86-64) => /lib64/libk5crypto.so.3
libkrb5support.so.0 (libc6,x86-64) => /lib64/libkrb5support.so.0
libkrb5.so.3 (libc6,x86-64) => /lib64/libkrb5.so.3
libkrad.so.0 (libc6,x86-64) => /lib64/libkrad.so.0
libkmod.so.2 (libc6,x86-64) => /lib64/libkmod.so.2
libkeyutils.so.1 (libc6,x86-64) => /lib64/libkeyutils.so.1
libkdb5.so.8 (libc6,x86-64) => /lib64/libkdb5.so.8
libjson.so.0 (libc6,x86-64) => /lib64/libjson.so.0
libjson-c.so.2 (libc6,x86-64) => /lib64/libjson-c.so.2
libidn.so.11 (libc6,x86-64) => /lib64/libidn.so.11
libhistory.so.6 (libc6,x86-64) => /lib64/libhistory.so.6
libgthread-2.0.so.0 (libc6,x86-64) => /lib64/libgthread-2.0.so.0
libgssrpc.so.4 (libc6,x86-64) => /lib64/libgssrpc.so.4
libgssapi_krb5.so.2 (libc6,x86-64) => /lib64/libgssapi_krb5.so.2
libgpgme.so.11 (libc6,x86-64) => /lib64/libgpgme.so.11
libgpgme-pthread.so.11 (libc6,x86-64) => /lib64/libgpgme-pthread.so.11
libgpg-error.so.0 (libc6,x86-64) => /lib64/libgpg-error.so.0
libgomp.so.1 (libc6,x86-64) => /lib64/libgomp.so.1
libgobject-2.0.so.0 (libc6,x86-64) => /lib64/libgobject-2.0.so.0
libgmpxx.so.4 (libc6,x86-64) => /lib64/libgmpxx.so.4
libgmp.so.10 (libc6,x86-64) => /lib64/libgmp.so.10
libgmodule-2.0.so.0 (libc6,x86-64) => /lib64/libgmodule-2.0.so.0
libglib-2.0.so.0 (libc6,x86-64) => /lib64/libglib-2.0.so.0
libgirepository-1.0.so.1 (libc6,x86-64) => /lib64/libgirepository-1.0.so.1
libgio-2.0.so.0 (libc6,x86-64) => /lib64/libgio-2.0.so.0
libgdbm_compat.so.4 (libc6,x86-64) => /lib64/libgdbm_compat.so.4
libgdbm.so.4 (libc6,x86-64) => /lib64/libgdbm.so.4
libgcrypt.so.11 (libc6,x86-64) => /lib64/libgcrypt.so.11
libgcc_s.so.1 (libc6,x86-64) => /lib64/libgcc_s.so.1
libfreebl3.so (libc6,x86-64) => /lib64/libfreebl3.so
libfreeblpriv3.so (libc6,x86-64) => /lib64/libfreeblpriv3.so
libformw.so.5 (libc6,x86-64) => /lib64/libformw.so.5
libform.so.5 (libc6,x86-64) => /lib64/libform.so.5
libffi.so.6 (libc6,x86-64) => /lib64/libffi.so.6
libexpat.so.1 (libc6,x86-64) => /lib64/libexpat.so.1
libelf.so.1 (libc6,x86-64) => /lib64/libelf.so.1
libdw.so.1 (libc6,x86-64) => /lib64/libdw.so.1
libdl.so.2 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libdl.so.2
libdl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libdl.so
libdevmapper.so.1.02 (libc6,x86-64) => /lib64/libdevmapper.so.1.02
libdbus-1.so.3 (libc6,x86-64) => /lib64/libdbus-1.so.3
libdbus-glib-1.so.2 (libc6,x86-64) => /lib64/libdbus-glib-1.so.2
libdb-5.3.so (libc6,x86-64) => /lib64/libdb-5.3.so
libcusparse.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusparse.so.10
libcusparse.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusparse.so
libcusolverMg.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolverMg.so.10
libcusolverMg.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolverMg.so
libcusolver.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolver.so.10
libcusolver.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcusolver.so
libcurl.so.4 (libc6,x86-64) => /lib64/libcurl.so.4
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcurand.so
libcuinj64.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcuinj64.so.10.1
libcuinj64.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcuinj64.so
libcufftw.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufftw.so.10
libcufftw.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufftw.so
libcufft.so.10 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft.so.10
libcufft.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft.so
libcudart.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1
libcudart.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so
libcublasLt.so.10 (libc6,x86-64) => /lib64/libcublasLt.so.10
libcublasLt.so (libc6,x86-64) => /lib64/libcublasLt.so
libcublas.so.10 (libc6,x86-64) => /lib64/libcublas.so.10
libcublas.so (libc6,x86-64) => /lib64/libcublas.so
libcryptsetup.so.12 (libc6,x86-64) => /lib64/libcryptsetup.so.12
libcryptsetup.so.4 (libc6,x86-64) => /lib64/libcryptsetup.so.4
libcrypto.so.10 (libc6,x86-64) => /lib64/libcrypto.so.10
libcrypt.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcrypt.so.1
libcrypt.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcrypt.so
libcrack.so.2 (libc6,x86-64) => /lib64/libcrack.so.2
libcom_err.so.2 (libc6,x86-64) => /lib64/libcom_err.so.2
libcidn.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcidn.so.1
libcidn.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libcidn.so
libcap.so.2 (libc6,x86-64) => /lib64/libcap.so.2
libcap-ng.so.0 (libc6,x86-64) => /lib64/libcap-ng.so.0
libc.so.6 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libc.so.6
libbz2.so.1 (libc6,x86-64) => /lib64/libbz2.so.1
libblkid.so.1 (libc6,x86-64) => /lib64/libblkid.so.1
libbfd-2.27-44.base.el7.so (libc6,x86-64) => /lib64/libbfd-2.27-44.base.el7.so
libauparse.so.0 (libc6,x86-64) => /lib64/libauparse.so.0
libaudit.so.1 (libc6,x86-64) => /lib64/libaudit.so.1
libattr.so.1 (libc6,x86-64) => /lib64/libattr.so.1
libassuan.so.0 (libc6,x86-64) => /lib64/libassuan.so.0
libasm.so.1 (libc6,x86-64) => /lib64/libasm.so.1
libanl.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libanl.so.1
libanl.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libanl.so
libacl.so.1 (libc6,x86-64) => /lib64/libacl.so.1
libaccinj64.so.10.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libaccinj64.so.10.1
libaccinj64.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libaccinj64.so
libSegFault.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libSegFault.so
libOpenCL.so.1 (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libOpenCL.so.1
libOpenCL.so (libc6,x86-64) => /usr/local/cuda-10.1/targets/x86_64-linux/lib/libOpenCL.so
libBrokenLocale.so.1 (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libBrokenLocale.so.1
libBrokenLocale.so (libc6,x86-64, OS ABI: Linux 2.6.32) => /lib64/libBrokenLocale.so
ld-linux-x86-64.so.2 (libc6,x86-64) => /lib64/ld-linux-x86-64.so.2
Is there any dead symbol link which named libnvidia-ml.so
or libnvidia-ml.so.1
in your image? If so, remove them.
Is there any dead symbol link which named
libnvidia-ml.so
orlibnvidia-ml.so.1
in your image? If so, remove them.
我用find . -xtype l
或symlinks .
都没有发现失效的符号链接。
我发现一个现象,我在host上的驱动安装目录/usr/lib64
查找libnvidia-ml
,发现有2个链接(libnvidia-ml.so
, libnvidia-ml.so.1
)和1个动态库(libnvidia-ml.so.450.66
):
lrwxrwxrwx 1 root root 17 Jun 11 10:29 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root 22 Jun 11 10:29 libnvidia-ml.so.1 -> libnvidia-ml.so.450.66
-rwxr-xr-x 1 root root 1905848 Jun 11 10:29 libnvidia-ml.so.450.66
而在gpu-manager的相关目录/etc/gpu-manager/vdriver/nvidia/lib64
中查找libnvidia-ml
,发现没有libnvidia-ml.so.1
,只有libnvidia-ml.so
这个链接和2个动态库(libnvidia-ml.so.450.66
和libnvidia-ml.so.440.36
):
lrwxrwxrwx 1 root root 22 Jun 15 14:21 libnvidia-ml.so -> libnvidia-ml.so.450.66
-rwxr-xr-x 1 root root 1465752 Jun 2 14:15 libnvidia-ml.so.440.36
-rwxr-xr-x 1 root root 1905848 Jun 15 14:21 libnvidia-ml.so.450.66
其中libnvidia-ml.so.440.36
这个动态库对应旧版本的驱动,在我的节点上应该已经删了。
是不是没有libnvidia-ml.so.1
导致的?
另外,有一个迂回的办法,用LD_PRELOAD
这个环境变量手动加载动态库,就可以解决问题:
# LD_PRELOAD=/usr/local/nvidia/lib64/libnvidia-ml.so nvidia-smi
Thu Jun 17 06:46:25 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 23% 40C P8 17W / 250W | 0MiB / 11176MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
所以动态库应该是有效的,只是不知道为什么nvidia-smi这个命令找不到它。
nvidia-smi try to dlopen libnvidia-ml.so.1
, What's the version of your gpu-manager?
nvidia-smi try to dlopen
libnvidia-ml.so.1
, What's the version of your gpu-manager?
我在master分支拉取的代码,生成的gpu-manager镜像是1.1.4:
REPOSITORY TAG IMAGE ID CREATED SIZE
tkestack/gpu-manager 1.1.4 0a74a803da06 2 days ago 10.8 GB
Please provides logs that contains Mirror %s to %s
and Vcuda %s to %s
Please provides logs that contains
Mirror %s to %s
and `Vcuda %s
请问这是什么log?应该如何获取?
可以尝试检查文件 /etc/nvidia-container-runtime/config.toml
中的 ldconfig
值,默认是 "@/sbin/ldconfig"
改为 "/sbin/ldconfig"
试试?
可以尝试检查文件
/etc/nvidia-container-runtime/config.toml
中的ldconfig
值,默认是"@/sbin/ldconfig"
改为"/sbin/ldconfig"
试试?
感谢回复!GPU manager没有使用nvidia docker,所以在/etc/nvidia-container-runtime/config.toml检查也是没有效果的。
Please provides logs that contains
Mirror %s to %s
and `Vcuda %s请问这是什么log?应该如何获取?
gpu-manager的日志,调高日志级别可以看到
@mYmNeo 大佬,请问一下,我遇到的情况跟cailun01类似,而且我的pod里在/dev/下面还找不到nvidia的设备,我用的是最新版本的gpu-manager
有没有其他大佬知道怎么解决的啊
创建vcuda这个pod之后,输入
nvidia-smi
报了找不到libnvidia-ml.so
的错误:但是,我可以在pod中找到
libnvidia-ml.so
:运行
ldconfig.real
输出以下日志:经在网上查询,有可能是显卡驱动的问题,但是我在host上运行nvidia-smi是可以正常输出的。
而且,在host卸载了驱动,然后重新安装,仍然不行。