rancher / os

Tiny Linux distro that runs the entire OS as Docker containers
https://rancher.com/docs/os/v1.x/en/
Apache License 2.0
6.44k stars 658 forks source link

NVidia cuda support #1637

Open flesicek opened 7 years ago

flesicek commented 7 years ago

Could you please consider supporting NVidia cuda drivers implementation in Rancher OS?

NVidia is already providing docker support here https://github.com/NVIDIA/nvidia-docker

doprdele commented 6 years ago

+1

lost-carrier commented 6 years ago

+1

wchao1241 commented 6 years ago

Working on the support of Nvidia cuda drivers implementation in Rancher OS.But now there is a problem that hardware devices cannot be identified in the os kernel,which leads to the fact that driver cannot be installed.

The following are the contrast between RancherOS and Ubuntu16.04 RancherOS: 00:1e.0 Class 0302: 10de:102d

Ubuntu16.04: 00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

vincent99 commented 6 years ago

Drivers identify the hardware by their PCI IDs; 10de:102dis the ID of that card. There is not a database of ID->human friendly name mappings loaded in RancherOS, but this shouldn't affect anything with the driver detecting it.

wchao1241 commented 6 years ago

@vincent99 Yes, drivers identify hardware by PCI IDs.The error of installation caused by other reasons. Thank you very much.

kingsd041 commented 6 years ago

Tested with rancheros v1.4.0-rc1. @wchao1241 I verified this issue with reference to https://github.com/rancher/os-services/tree/master/n, but I encountered some errors in the execution of /var/lib/rancher/nvidia/build.sh that made me No way to continue. The below is error output

  ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

  ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See
         /var/log/nvidia-installer.log for details.

  ERROR: The nvidia kernel module was not created.

  ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation
         problems in the README available on the Linux driver download page at www.nvidia.com.
niusmallnan commented 6 years ago

The nvidia-docker cannot work with our new kernel, we have asked for help to that community. Keep this open before we find the solution and remove this feature in v1.4.0 milestone.

doprdele commented 6 years ago

Does this work now?

niusmallnan commented 6 years ago

We have fixed the kernel issue, I think we can add this nvidia-docker support on next release.

doprdele commented 6 years ago

Awesome. Do you know when the next release might be for RancherOS?

On Aug 15 2018, at 8:56 pm, niusmallnan notifications@github.com wrote:

We have fixed the kernel issue, I think we can add this nvidia-docker support on next release. — You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/rancher/os/issues/1637#issuecomment-413387302), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABu0acPTYAmES5dv9H0-UlFyoB_owcRyks5uRMNXgaJpZM4MFled).

mcapuccini commented 5 years ago

Hi there! Any update on this?

niusmallnan commented 5 years ago

It can work in the Ubuntu console, but we want to support it in the default console. We are making the final effort.

tech98321469320842 commented 5 years ago

@niusmallnan What is the current status for the nVidia CUDA integration? Is it possible to deploy it in some way?

tech98321469320842 commented 5 years ago

@niusmallnan At which point in time will the version 1.5.1 be released?

niusmallnan commented 5 years ago

@tech98321469320842 At the end of Feb.

niusmallnan commented 5 years ago

I want to integrate nvidia-docker2, currently mainly related to these projects.

They almost only provide deb and rpm packages, and it seems difficult to install from binary. So at this stage I don't plan to support it in the default console. I will give priority to supporting it in the ubuntu console.

Usually we just need to add the apt source and install the corresponding deb, but there will be a problem in ROS. The nvidia-docker2 relies on docker deb files, ROS does not use the deb to manage docker.

# https://github.com/NVIDIA/nvidia-docker/blob/master/debian/control

Package: nvidia-docker2

Architecture: all
Breaks: nvidia-docker (<< 2.0.0)
Replaces: nvidia-docker (<< 2.0.0)
Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@

So I can customize nvidia-docker2, just remove this dependency.

Boot a vm(Ubuntu 18.04), and build the package after this patch

diff --git a/debian/control b/debian/control
index d06d85f..86d2023 100644
--- a/debian/control
+++ b/debian/control
@@ -12,7 +12,7 @@ Package: nvidia-docker2
 Architecture: all
 Breaks: nvidia-docker (<< 2.0.0)
 Replaces: nvidia-docker (<< 2.0.0)
-Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@
+Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@)
 Description: nvidia-docker CLI wrapper
  Replaces nvidia-docker with a new implementation based on
  nvidia-container-runtime

Just run make 18.06.1-ubuntu18.04, you can replace 18.06.1 if you want to support other docker version.

Boot a ROS(v1.5.0) instance, and add nvidia-docker repo, but we need to use the ubuntu console

ros console switch ubuntu

apt update && apt install gnupg

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt update

Install pakages

apt install  nvidia-container-runtime=2.0.0+docker18.06.1-1

# install your custom nvidia-docker 
dpkg -i nvidia-docker2_2.0.3+docker18.06.1-1_all.deb
rkdgo commented 5 years ago

Tesla K80 installed

lspci | grep NVIDIA 04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Is there a special way of installing NVIDIA driver on RancherOS 1.5.0?

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

reports an error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=8281 /var/lib/docker/overlay2/9b72f828525e4a83bd6084006f7caf8af91ad99b8249cc50e89de7053f24462e/merged]\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\\"\"": unknown.

mathieupost commented 5 years ago

I got a similar error

lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GK106M [GeForce GTX 765M] (rev a1)
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=8168 /var/lib/docker/overlay2/514c92b8bb41862f9810364512638163ada28875a193b4f4200e7d6563ee15ac/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.

although the error is different at the end initialization error: driver error: failed to process request

EDIT: Turned out the nvidia driver was not correctly installed for me. I can't install it correctly because the nouveau kernel module gets loaded every boot although is it blacklisted by /etc/modprobe.d/nvidia-installer-disable-nouveau.conf

lygstate commented 5 years ago

@niusmallnan So it's fixed now?

turley commented 5 years ago

I went through the steps outlined by @niusmallnan above and kept running into the following error:

nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1

After a little digging, I found that it's failing when trying to pivot_root here: https://github.com/NVIDIA/libnvidia-container/blob/deccb2801502675bd283c6936861814dbca99ecd/src/nvc_ldcache.c#L117

I'm not sure why it's failing there or how to fix it, but thought I'd share my findings in case it helps someone else narrow down the issue.

kidhasmoxy commented 5 years ago

@davidhyman did you get this to work on rancher OS using the patched package?

tobylo commented 5 years ago

Is this issue still being worked on and if so, any update on the status?

Confusingboat commented 5 years ago

+1 interested in the status of this issue

stlaurentc commented 4 years ago

+1 interested in the status of this issue

NM4 commented 4 years ago

+1 interested in the status of this issue

andrew-mcgrath commented 4 years ago

+1 interested in the status of this issue

user-name-is-taken commented 4 years ago

Looking through the rancher docs I found this page that talks about scheduling pods to nodes with gpus for what it's worth.

piersdd commented 4 years ago

+1 interested in the status of this issue

redbaron-gt commented 4 years ago

+1 interested in the status of this issue