Open flesicek opened 7 years ago
+1
+1
Working on the support of Nvidia cuda drivers implementation in Rancher OS.But now there is a problem that hardware devices cannot be identified in the os kernel,which leads to the fact that driver cannot be installed.
The following are the contrast between RancherOS and Ubuntu16.04 RancherOS: 00:1e.0 Class 0302: 10de:102d
Ubuntu16.04: 00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Drivers identify the hardware by their PCI IDs; 10de:102d
is the ID of that card. There is not a database of ID->human friendly name mappings loaded in RancherOS, but this shouldn't affect anything with the driver detecting it.
@vincent99 Yes, drivers identify hardware by PCI IDs.The error of installation caused by other reasons. Thank you very much.
Tested with rancheros v1.4.0-rc1.
@wchao1241 I verified this issue with reference to https://github.com/rancher/os-services/tree/master/n, but I encountered some errors in the execution of /var/lib/rancher/nvidia/build.sh
that made me No way to continue.
The below is error output
ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.
ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See
/var/log/nvidia-installer.log for details.
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation
problems in the README available on the Linux driver download page at www.nvidia.com.
The nvidia-docker cannot work with our new kernel, we have asked for help to that community. Keep this open before we find the solution and remove this feature in v1.4.0 milestone.
Does this work now?
We have fixed the kernel issue, I think we can add this nvidia-docker support on next release.
Awesome. Do you know when the next release might be for RancherOS?
On Aug 15 2018, at 8:56 pm, niusmallnan notifications@github.com wrote:
We have fixed the kernel issue, I think we can add this nvidia-docker support on next release. — You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/rancher/os/issues/1637#issuecomment-413387302), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABu0acPTYAmES5dv9H0-UlFyoB_owcRyks5uRMNXgaJpZM4MFled).
Hi there! Any update on this?
It can work in the Ubuntu console, but we want to support it in the default console. We are making the final effort.
@niusmallnan What is the current status for the nVidia CUDA integration? Is it possible to deploy it in some way?
@niusmallnan At which point in time will the version 1.5.1 be released?
@tech98321469320842 At the end of Feb.
I want to integrate nvidia-docker2, currently mainly related to these projects.
They almost only provide deb and rpm packages, and it seems difficult to install from binary. So at this stage I don't plan to support it in the default console. I will give priority to supporting it in the ubuntu console.
Usually we just need to add the apt source and install the corresponding deb, but there will be a problem in ROS. The nvidia-docker2 relies on docker deb files, ROS does not use the deb to manage docker.
# https://github.com/NVIDIA/nvidia-docker/blob/master/debian/control
Package: nvidia-docker2
Architecture: all
Breaks: nvidia-docker (<< 2.0.0)
Replaces: nvidia-docker (<< 2.0.0)
Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@
So I can customize nvidia-docker2, just remove this dependency.
Boot a vm(Ubuntu 18.04), and build the package after this patch
diff --git a/debian/control b/debian/control
index d06d85f..86d2023 100644
--- a/debian/control
+++ b/debian/control
@@ -12,7 +12,7 @@ Package: nvidia-docker2
Architecture: all
Breaks: nvidia-docker (<< 2.0.0)
Replaces: nvidia-docker (<< 2.0.0)
-Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@
+Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@)
Description: nvidia-docker CLI wrapper
Replaces nvidia-docker with a new implementation based on
nvidia-container-runtime
Just run make 18.06.1-ubuntu18.04
, you can replace 18.06.1
if you want to support other docker version.
Boot a ROS(v1.5.0) instance, and add nvidia-docker repo, but we need to use the ubuntu console
ros console switch ubuntu
apt update && apt install gnupg
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt update
Install pakages
apt install nvidia-container-runtime=2.0.0+docker18.06.1-1
# install your custom nvidia-docker
dpkg -i nvidia-docker2_2.0.3+docker18.06.1-1_all.deb
Tesla K80 installed
lspci | grep NVIDIA 04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Is there a special way of installing NVIDIA driver on RancherOS 1.5.0?
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
reports an error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=8281 /var/lib/docker/overlay2/9b72f828525e4a83bd6084006f7caf8af91ad99b8249cc50e89de7053f24462e/merged]\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\\"\"": unknown.
I got a similar error
lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GK106M [GeForce GTX 765M] (rev a1)
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=8168 /var/lib/docker/overlay2/514c92b8bb41862f9810364512638163ada28875a193b4f4200e7d6563ee15ac/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
although the error is different at the end initialization error: driver error: failed to process request
EDIT:
Turned out the nvidia driver was not correctly installed for me. I can't install it correctly because the nouveau kernel module gets loaded every boot although is it blacklisted by /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
@niusmallnan So it's fixed now?
I went through the steps outlined by @niusmallnan above and kept running into the following error:
nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1
After a little digging, I found that it's failing when trying to pivot_root
here:
https://github.com/NVIDIA/libnvidia-container/blob/deccb2801502675bd283c6936861814dbca99ecd/src/nvc_ldcache.c#L117
I'm not sure why it's failing there or how to fix it, but thought I'd share my findings in case it helps someone else narrow down the issue.
@davidhyman did you get this to work on rancher OS using the patched package?
Is this issue still being worked on and if so, any update on the status?
+1 interested in the status of this issue
+1 interested in the status of this issue
+1 interested in the status of this issue
+1 interested in the status of this issue
Looking through the rancher docs I found this page that talks about scheduling pods to nodes with gpus for what it's worth.
+1 interested in the status of this issue
+1 interested in the status of this issue
Could you please consider supporting NVidia cuda drivers implementation in Rancher OS?
NVidia is already providing docker support here https://github.com/NVIDIA/nvidia-docker