nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.59k stars 1.31k forks source link

Latest docker images dont run instant-ngp on 3090 series cards. - CUDA error: no kernel image is available for execution on the device #2878

Open kurtjcu opened 9 months ago

kurtjcu commented 9 months ago

Describe the bug Current docker images with tags "main", "1.0.1", and "1.0.0" crash when training.

RuntimeError: CUDA error: no kernel image is available for execution on the device

To Reproduce Steps to reproduce the behavior: Do this on a machine with a 3090 (ubuntu server, nvidia driver support up to cuda 12.3)

wandb docker-run \
            --gpus "device=1" \
            -u $(id -u) \
            -v "datadir:/workspace" \
            -v '/home/user/.cache/:/home/user/.cache/' \
            -p 7017:7007 \
            --rm -it --shm-size=12gb \
            dromni/nerfstudio:1.0.1 ns-train instant-ngp \
            --vis wandb \
            --data "/workspace" 

Expected behavior Using the above command with container tag "0.3.4" functions correctly

yuefengyf commented 9 months ago

+1, similar issue also seen in splatfacto with docker image "1.0.0" and "1.0.1". Here is the error I got:

Screenshot 2024-02-14 at 3 31 41 PM
rowellz commented 9 months ago

I am getting similar issues as well for both 1.0.0 and 1.0.1 images :(

jkulhanek commented 9 months ago

Hi, can you please try the nerfstudio/nerfstudio:latest image?

rowellz commented 9 months ago

That seems to do the trick for me, TYSM! I've been using gaussian-splatting for the dromni/nerfstudio:main image for a while now. Is there any difference or improvement with the splatfacto method in the nerfstudio/nerfstudio:latest image?

Edit: Just wanted to say I am running on an RTX 3060 12GB

jkulhanek commented 9 months ago

Good to hear it works. I am still testing the image, so please let me know if you find any issues. Unfortunately, I don't have access to the Dockerfile used to compile dromni/nerfstudio:main, but the nerfstudio/nerfstudio:latest image just contains the current last commit in the main branch of nerfstudio.

kurtjcu commented 9 months ago

I was having the same issue on all the latest images as well. I used docker to compile one from main branch and then it worked.

eduardohenriquearnold commented 5 months ago

I was having the same issue on an NVIDIA T4, but the nerfstudio/nerfstudio:latest image seems to work like a charm - thanks @jkulhanek .

May I ask what have you changed? Looking online it seems to be something related to the CUDA supported architectures, but it seems like the official nerfstudio docker image contains most common archs, including sm_75 which corresponds to the T4. So I'm just curious what specifically is addressing the issue here?

jinwookpark commented 2 months ago

I solved it by modifying CUDA_ARCHITECTURES in Dockerfile and then building docker. (My GPU is NVIDIA RTX 3090)

ARG CUDA_ARCHITECTURES=90;89;86;80;75;70;61;52;37

ARG CUDA_ARCHITECTURES=90;89;86;80;75;70

jkulhanek commented 2 months ago

I was having the same issue on an NVIDIA T4, but the nerfstudio/nerfstudio:latest image seems to work like a charm - thanks @jkulhanek .

May I ask what have you changed? Looking online it seems to be something related to the CUDA supported architectures, but it seems like the official nerfstudio docker image contains most common archs, including sm_75 which corresponds to the T4. So I'm just curious what specifically is addressing the issue here?

Sorry, but I don’t have access to dockerfiles used by dromni to build dromni/nerfstudio images. I don’t know which cuda archs they built with.

jkulhanek commented 2 months ago

I solved it by modifying CUDA_ARCHITECTURES in Dockerfile and then building docker. (My GPU is NVIDIA RTX 3090)

ARG CUDA_ARCHITECTURES=90;89;86;80;75;70;61;52;37 ARG CUDA_ARCHITECTURES=90;89;86;80;75;70

I believe 3090 has cuda compute 75, so the default docker image should work just fine. Are you having issues?

jinwookpark commented 2 months ago

I solved it by modifying CUDA_ARCHITECTURES in Dockerfile and then building docker. (My GPU is NVIDIA RTX 3090)

ARG CUDA_ARCHITECTURES=90;89;86;80;75;70;61;52;37 ARG CUDA_ARCHITECTURES=90;89;86;80;75;70

I believe 3090 has cuda compute 75, so the default docker image should work just fine. Are you having issues?

The Compute Capability of the GeForce RTX 3090 is 8.6 (https://developer.nvidia.com/cuda-gpus). Although it's hard to pinpoint the exact reason, when older versions such as 61;52;37 were built together in CUDA_ARCHITECTURES, the same issues as reported by other users occurred. Like everyone else, this solution was discovered through a lot of trial and error. :)