supranational / supra_seal

Apache License 2.0
29 stars 21 forks source link

No CUDA devices available in container #42

Closed RobQuistNL closed 7 months ago

RobQuistNL commented 8 months ago

I'm trying to run supraseal within a docker container.

Compilation within the container goes great - the compiled binary even runs smoothly on the host machine, however, when I run the binary from within the container, the groth implementation throws an error; No CUDA devices available

Now I'd say this is some missing library or link somewhere, but the weird thing is, clinfo and nvidia-smi both work fine inside the container. Even the regular (non supraseal) CUDA implementation of filecoin-proofs works fine when compiled and ran within the docker container.

I'm not sure what I'm missing or how that ngpus function works..

dot-asm commented 8 months ago

nvidia-smi both work fine inside the machine.

Do you mean in the container? Could you provide output from nvidia-smi -q? Specifically from within the container. As for ngpus, there is a filter based on compute capability version, GPUs below compute capability 7 are skipped.

RobQuistNL commented 8 months ago

Yes, I mean inside the container - edited my post. It is based off of nvidia/cuda:12.3.1-devel-ubuntu22.04

Here's the output of nvidia-smi -q from within the same running docker instance that built the binaries (and can run regular cuda but not the cuda-supraseal feature version)

I've trimmed the lower parts - but could it be that the 2080TI I'm testing this on doesn't have compute capability 7 or higher?

Timestamp                                 : Fri Feb  9 18:28:35 2024
Driver Version                            : 535.154.05
CUDA Version                              : 12.3

Attached GPUs                             : 1
GPU 00000000:0A:00.0
    Product Name                          : NVIDIA GeForce RTX 2080 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None

If thats the case this should run fine on an A40 or 4090 - I'll test the image and see how it runs on some newer cards.

FYI, also the ldd output here;

ldd /usr/bin/c2-sealer 
        linux-vdso.so.1 (0x00007ffcde9f8000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f52e3ecd000)
        libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f52e2249000)
        libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f52e2229000)
        libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f52e2142000)
        libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f52e1f19000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f52e53cd000)
        libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f52e1f12000)
        libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f52e1f0d000)
        librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f52e1f08000)
dot-asm commented 8 months ago

2080 if fine, it's capability 7.5.

RobQuistNL commented 8 months ago

EDIT: Nope, can't be that...

Binary inside the container:

Starting C2
thread 'actix-rt|system:0|arbiter:0' panicked at src/c2.rs:71:87:
called `Result::unwrap()` on an `Err` value: No CUDA devices available
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Same binary (copied using docker cp) and running on the host;

Starting C2

(I quit it because this host doesn't have >128gb ram :+) )

dot-asm commented 8 months ago

Could you save following as a.cu (or whichever name you prefer), compile it with nvcc and execute in the container?

#include <iostream>

int main()
{
   int n;
   std::cout << cudaGetDeviceCount(&n) << std::endl;
   std::cout << n << std::endl;
}
RobQuistNL commented 8 months ago

In the container;

./testexec 
804
32638

outside;

./testexec 
0
1

EDIT: since this might be some forward compatibility issue (what I gathered from Googling), there is a difference between host CUDA and container CUDA:

HOST:

NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2

Container:

NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.3
dot-asm commented 8 months ago

804 stands for cudaErrorCompatNotSupportedOnDevice. (And 32638 is an uninitialized variable). Error description is

    /**
     * This error indicates that the system was upgraded to run with forward compatibility
     * but the visible hardware detected by CUDA does not support this configuration.
     * Refer to the compatibility documentation for the supported hardware matrix or ensure
     * that only supported hardware is visible during initialization via the CUDA_VISIBLE_DEVICES
     * environment variable.
     */

On my side it failed with 35, cudaErrorInsufficientDriver in the referred container. (Which doesn't have nvidia-smi!?). strace-ing revealed that it failed to find libcuda.so.1. Which was found in an unusual /usr/local/cuda/compat. If executed with LD_LIBRARY_PATH=/usr/local/cuda/compat the snippet works. As well as some simple test programs...

dot-asm commented 8 months ago

As for your HOST vs. Container. [CUDA versions are different!] Is it possible that the host version has to be not lower than container? [Just in case for reference, I have 12.3 on the host side.]

RobQuistNL commented 8 months ago

That might be the case - I'm trying to upgrade my host to 12.3 now (I'd still find it weird, but who knows)

RobQuistNL commented 8 months ago

Is there a particular strace output you'd be interested in?

EDIT: In the container it looks like it finds it just fine;

mmap(NULL, 14883, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fdd040fe000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p-\16\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=29453200, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 29898688, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fdd01f15000
dot-asm commented 8 months ago

In the container it looks like it finds it just fine;

Well, I don't have it there. For reference I simply pulled the referred container and passed few --device arguments. Anyway, resolve your 804...

RobQuistNL commented 8 months ago

Aah right, you might need to do an additional apt install nvidia-cuda-toolkit -y in the container to get that working EDIT: My running container actually doesn't even have that :+) Now I'm stuck in dpkg hell with all the nvidia drivers so I'm going to be stuck here for a while..

dot-asm commented 8 months ago

The only thing that is unclear to me is how come "the regular (non supraseal) CUDA implementation of filecoin-proofs works fine ... within the docker container." Is it possible that it falls back to CPU code? So it's not like it actually utilized GPU...

RobQuistNL commented 8 months ago

Hmm, you might be right about that - the groth16 steps just explodes because this machine has 128gb only - let me try with 512m sectors instead of 32gb