Closed RobQuistNL closed 7 months ago
nvidia-smi both work fine inside the machine.
Do you mean in the container? Could you provide output from nvidia-smi -q
? Specifically from within the container. As for ngpus
, there is a filter based on compute capability version, GPUs below compute capability 7 are skipped.
Yes, I mean inside the container - edited my post. It is based off of nvidia/cuda:12.3.1-devel-ubuntu22.04
Here's the output of nvidia-smi -q
from within the same running docker instance that built the binaries (and can run regular cuda but not the cuda-supraseal feature version)
I've trimmed the lower parts - but could it be that the 2080TI I'm testing this on doesn't have compute capability 7 or higher?
Timestamp : Fri Feb 9 18:28:35 2024
Driver Version : 535.154.05
CUDA Version : 12.3
Attached GPUs : 1
GPU 00000000:0A:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Addressing Mode : None
If thats the case this should run fine on an A40 or 4090 - I'll test the image and see how it runs on some newer cards.
FYI, also the ldd
output here;
ldd /usr/bin/c2-sealer
linux-vdso.so.1 (0x00007ffcde9f8000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f52e3ecd000)
libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f52e2249000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f52e2229000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f52e2142000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f52e1f19000)
/lib64/ld-linux-x86-64.so.2 (0x00007f52e53cd000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f52e1f12000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f52e1f0d000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f52e1f08000)
2080 if fine, it's capability 7.5.
EDIT: Nope, can't be that...
Binary inside the container:
Starting C2
thread 'actix-rt|system:0|arbiter:0' panicked at src/c2.rs:71:87:
called `Result::unwrap()` on an `Err` value: No CUDA devices available
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Same binary (copied using docker cp
) and running on the host;
Starting C2
(I quit it because this host doesn't have >128gb ram :+) )
Could you save following as a.cu (or whichever name you prefer), compile it with nvcc and execute in the container?
#include <iostream>
int main()
{
int n;
std::cout << cudaGetDeviceCount(&n) << std::endl;
std::cout << n << std::endl;
}
In the container;
./testexec
804
32638
outside;
./testexec
0
1
EDIT: since this might be some forward compatibility issue (what I gathered from Googling), there is a difference between host CUDA and container CUDA:
HOST:
NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2
Container:
NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3
804 stands for cudaErrorCompatNotSupportedOnDevice
. (And 32638 is an uninitialized variable). Error description is
/**
* This error indicates that the system was upgraded to run with forward compatibility
* but the visible hardware detected by CUDA does not support this configuration.
* Refer to the compatibility documentation for the supported hardware matrix or ensure
* that only supported hardware is visible during initialization via the CUDA_VISIBLE_DEVICES
* environment variable.
*/
On my side it failed with 35, cudaErrorInsufficientDriver
in the referred container. (Which doesn't have nvidia-smi
!?). strace
-ing revealed that it failed to find libcuda.so.1. Which was found in an unusual /usr/local/cuda/compat
. If executed with LD_LIBRARY_PATH=/usr/local/cuda/compat
the snippet works. As well as some simple test programs...
As for your HOST vs. Container. [CUDA versions are different!] Is it possible that the host version has to be not lower than container? [Just in case for reference, I have 12.3 on the host side.]
That might be the case - I'm trying to upgrade my host to 12.3 now (I'd still find it weird, but who knows)
Is there a particular strace
output you'd be interested in?
EDIT: In the container it looks like it finds it just fine;
mmap(NULL, 14883, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fdd040fe000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p-\16\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=29453200, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 29898688, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fdd01f15000
In the container it looks like it finds it just fine;
Well, I don't have it there. For reference I simply pulled the referred container and passed few --device
arguments. Anyway, resolve your 804...
Aah right, you might need to do an additional apt install nvidia-cuda-toolkit -y
in the container to get that working
EDIT: My running container actually doesn't even have that :+)
Now I'm stuck in dpkg hell with all the nvidia drivers so I'm going to be stuck here for a while..
The only thing that is unclear to me is how come "the regular (non supraseal) CUDA implementation of filecoin-proofs works fine ... within the docker container." Is it possible that it falls back to CPU code? So it's not like it actually utilized GPU...
Hmm, you might be right about that - the groth16 steps just explodes because this machine has 128gb only - let me try with 512m sectors instead of 32gb
I'm trying to run supraseal within a docker container.
Compilation within the container goes great - the compiled binary even runs smoothly on the host machine, however, when I run the binary from within the container, the groth implementation throws an error;
No CUDA devices available
Now I'd say this is some missing library or link somewhere, but the weird thing is,
clinfo
andnvidia-smi
both work fine inside the container. Even the regular (non supraseal) CUDA implementation offilecoin-proofs
works fine when compiled and ran within the docker container.I'm not sure what I'm missing or how that
ngpus
function works..