Unable to run recent CUDA-enabled docker image against GPU

xrd commented 1 year ago

I'm trying to use the latest docker image to run a neural network example on GPU/CUDA.

Environment

Elixir & Erlang/OTP versions (elixir --version): Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] [x86_64-pc-linux-gnu]
Operating system: Linux (NixOS)
How have you started Livebook: docker
Livebook version (use git rev-parse HEAD if running with mix): v0.8.0
Browsers that reproduce this bug (the more the merrier): Firefox
Include what is logged in the browser console: see below
Include what is logged to the server console: see below

Current behavior

Start docker with GPUs permitted:

sudo docker run -p 8080:8080 -p 8081:8081 --gpus all --pull always -e LIVEBOOK_PASSWORD="securesecret" livebook/livebook
latest: Pulling from livebook/livebook
Digest: sha256:a61ce1bfa5fb17b43b77590af2c77a7207c337f2a267fe4c5c95379b24299d08
Status: Image is up to date for livebook/livebook:latest
[Livebook] Application running at http://0.0.0.0:8080/

Go to http://localhost:8080.

Go to settings, set XLA_TARGET to cuda118.

Create a new notebook. Add the Mix install code:

Mix.install(
  [
    {:kino_bumblebee, "~> 0.1.0"},
    {:exla, "~> 0.4.1"}
  ],
config: [nx: [default_backend: EXLA.Backend, client: :cuda]
]
)

It appears to download the CUDA enabled version of XLA for Linux inside the debug output

...
Generated tokenizers app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 29 files (.ex)
Generated nx app

19:28:42.502 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it

19:28:56.426 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz into /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/deps/exla/cache
g++ -fPIC -I/usr/local/lib/erlang/erts-12.3.2.2/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_ON_UNIX=1 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
==> kino
Compiling 37 files (.ex)
...

Once :ok is displayed, then add a smart cell with "Neural Network." Select txt2image. Click evaluate.

It sees the GPU, and then fails with:

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warn] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will wor

On the console we see this:


17:47:46.749 [debug] Downloading NIF from https://github.com/elixir-nx/tokenizers/releases/download/v0.2.0/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz

17:47:47.206 [debug] NIF cached at /home/livebook/.cache/rustler_precompiled/precompiled_nifs/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/c25e041b25205711d20d9e43d9305779/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

17:47:50.313 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it
--2022-12-13 17:47:50--  https://github.com/elixir-nx/xla/releases/download/v0.4.1/xla_extension-x86_64-linux-cuda118.tar.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-12-13 17:47:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190627018 (182M) [application/octet-stream]
Saving to: ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’

0K .......... .......... .......... .......... ..........  0% 3.64M 50s
...
186150K .........                                             100%  646M=5.6s

2022-12-13 17:47:56 (32.4 MB/s) - ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’ saved [190627018/190627018]

17:47:56.424 [info] Successfully downloaded the XLA archive
2022-12-13 17:48:27.986283: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 17:48:27.986309: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

17:49:22.248 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

17:49:22.248 [info] XLA service 0x7f09380093c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

17:49:22.248 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

17:49:22.248 [info] Using BFC allocator.

17:49:22.248 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

17:49:34.263 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:00:34.297 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

2022-12-13 18:01:08.594294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 18:01:08.594336: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

Expected behavior

It should let me use the GPU.

NB: I do have a GPU:

$ nvidia-smi 
Tue Dec 13 19:21:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P2    22W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

Also, I can run other docker images against the GPU:

$ git remote -v
origin  https://github.com/fboulnois/stable-diffusion-docker.git (fetch)
origin  https://github.com/fboulnois/stable-diffusion-docker.git (push)
 ./build.sh dev
$ python -c "import torch; print(torch.cuda.is_available())"
True

josevalim commented 1 year ago

The docker image is not compiled with cuda but we just released a cuda tag. Can you please try it instead?

xrd commented 1 year ago

@josevalim Yes, thank you. I saw the recent PR went onto main, so I assumed it would be in the regular docker image. Thank you, I'll try that and report back.

xrd commented 1 year ago

@josevalim Thank you. It works but then runs out of memory.

I tried the example of changing precision here (https://github.com/elixir-nx/bumblebee/issues/101#issuecomment-1344803404) but am not sure if I converted the code correctly. It is not working yet. I'll keep experimenting. Thanks for your assistance!

repository_id = "CompVis/stable-diffusion-v1-4"
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
vae  = %{vae | model: Axon.MixedPrecision.apply_policy(clip, policy)}

{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})

{:ok, featurizer} =
  Bumblebee.load_featurizer({:hf, repository_id, subdir: "feature_extractor"})

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 3,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 5],
    defn_options: [compiler: EXLA]
  )

josevalim commented 1 year ago

Yeah, we have done zero optimizations, so we hope there is a bunch to gain as we explore it!

livebook-dev / livebook