livebook-dev / livebook

Automate code & data workflows with interactive Elixir notebooks
https://livebook.dev
Apache License 2.0
4.92k stars 422 forks source link

livebook:latest-cuda12.1 Docker image showing warning missing NVIDIA driver #2359

Closed jonastemplestein closed 10 months ago

jonastemplestein commented 11 months ago

I'm trying to deploy the livebook:latest-cuda12.1 image to a fly.io host with an A100 GPU.

When I boot up the container, I get this big warning banner, saying CUDA support will not be available (and the image is deprecated but that doesn't matter)

2023-11-16T12:53:07.064 app[328749d1b59085] ams [info] ==========
2023-11-16T12:53:07.064 app[328749d1b59085] ams [info] == CUDA ==
2023-11-16T12:53:07.064 app[328749d1b59085] ams [info] ==========
2023-11-16T12:53:07.071 app[328749d1b59085] ams [info] CUDA Version 12.1.0
2023-11-16T12:53:07.073 app[328749d1b59085] ams [info] Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2023-11-16T12:53:07.074 app[328749d1b59085] ams [info] This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2023-11-16T12:53:07.074 app[328749d1b59085] ams [info] By pulling and using the container, you accept the terms and conditions of this license:
2023-11-16T12:53:07.074 app[328749d1b59085] ams [info] https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2023-11-16T12:53:07.074 app[328749d1b59085] ams [info] A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2023-11-16T12:53:07.087 app[328749d1b59085] ams [info] WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] Use the NVIDIA Container Toolkit to start this container with GPU support; see
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] https://docs.nvidia.com/datacenter/cloud-native/ .
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] *************************
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] ** DEPRECATION NOTICE! **
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] *************************
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
2023-11-16T12:53:07.088 app[328749d1b59085] ams [info] https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Is this expected? If I open a shell in the running container and run the following commands it looks like maybe the driver is working as intended?

root@328749d1b59085:/data# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Is the message in the banner on startup expected?

jonatanklosko commented 11 months ago

The NVIDIA Driver was not detected.

This usually pops up if the GPU is not detected. To double check you can open IEx, run Mix.install([:exla]); EXLA.Client.default_name() and see if it returns :cuda or :host.

If it's not detected, you can also try 11.8 just in case the driver does not support the latest CUDA. I'm mostly guessing though, the issue may lay somewhere lower level.

and the image is deprecated but that doesn't matter

That's ok, Nvidia now has a policy of deprecating images when they build a new version and they remove the deprecated images in 6 months. The removal is not an issue, because the Livebook CUDA image is not going to disappear.

josevalim commented 10 months ago

Closing this. if it persists, please let us know!

PranavRam commented 6 months ago

Hello, I'm using Cuda 12.4 and CudNN 8.9 with WSL in windows and the latest-cuda11.8 image.

With the 12.1 image, I don't see cuda in the list of supported platforms when running the command above, but with 11.8, I do get this:

iex(2)> EXLA.Client.get_supported_platforms()
%{host: 12, cuda: -1, interpreter: 1}

iex(3)> EXLA.Client.default_name()
:host

I assume that the Stable Diffusion job won't run on the GPU given that the target is host, instead of cuda? I originally tried the latest version of CudNN (9.0) but didn't even see the cuda:-1 in the list of supported platforms. I've tried running the nvidia benchmark checks in docker (wsl with Ubuntu) which shows my GPU.

Should I be trying another version of Cuda or CudNN?

jonatanklosko commented 6 months ago

@PranavRam are there any relevant warnings in the logs?

With the 12.1 image, I don't see cuda in the list of supported platforms

This is weird, are you sure you set XLA_TARGET=cuda120? You can run Mix.install([:exla], force: true) and double-check in the logs that it downloads the CUDA xla archive.

Should I be trying another version of Cuda or CudNN?

When running the Docker container, both CUDA and cuDNN is already installed in the container, so it doesn't matter what versions you have installed in the system. The drivers must be compatible with the given CUDA version.

Have you tried iex> in WSL directly, rather than the Docker image? Did that work differently?