CUDA error: all CUDA-capable devices are busy or unavailable

cr7Por commented 9 months ago

File "./thirdparty/gaussian_splatting/scene/cameras.py", line 53, in init self.original_image = image.clamp(0.0, 1.0).to(self.data_device) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

cr7Por commented 9 months ago

my cuda version is 11.7，is this a problem?

lizhan17 commented 9 months ago

Hi, could you get value from this ?

import torch
torch.version.cuda

and could you tell me what is your GPU model ?

cr7Por commented 9 months ago

import torch torch.version.cuda '11.6'

I am using rtx3090.

lizhan17 commented 9 months ago

Could you add CUDA_LAUNCH_BLOCKING=1 before excuting the script like CUDA_LAUNCH_BLOCKING=1 python ...

cr7Por commented 9 months ago

CUDA_LAUNCH_BLOCKING=1 python train.py --quiet --eval --config configs/n3d_lite/cut_roasted_beef.json --model_path log/cut_beef --source_path cut_roasted_beef/colmap_0

return runtimeerror immediately, no time for gpu memory to grow.

File "train.py", line 69, in train scene = Scene(dataset, gaussians, duration=duration, loader=dataset.loader) File "/home/ubuntu/liudong/SpacetimeGaussians/thirdparty/gaussian_splatting/scene/init.py", line 99, in init self.train_cameras[resolution_scale] = cameraList_from_camInfosv2(scene_info.train_cameras, resolution_scale, args) File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 260, in cameraList_from_camInfosv2 camera_list.append(loadCamv2(args, id, c, resolution_scale)) File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 109, in loadCamv2 image_name=cam_info.image_name, uid=id, data_device=args.data_device, near=cam_info.near, far=cam_info.far, timestamp=cam_info.timestamp, rayo=rays_o, rayd=rays_d,cxr=cam_info.cxr,cyr=cam_info.cyr) File "./thirdparty/gaussian_splatting/scene/cameras.py", line 53, in init self.original_image = image.clamp(0.0, 1.0).to(self.data_device) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

lizhan17 commented 9 months ago

Could you try solution in this post ? https://discuss.pytorch.org/t/distributeddataparallel-runtimeerror-cuda-error-all-cuda-capable-devices-are-busy-or-unavailable/102763/5

cr7Por commented 9 months ago

SpacetimeGaussians$ nvidia-smi -i 0 -c 0 Compute mode is already set to DEFAULT for GPU 00000000:01:00.0. All done.

CUDA_LAUNCH_BLOCKING=1 python train.py --quiet --eval --config configs/n3d_lite/cut_roasted_beef.json --model_path log/cut_beef --source_path cut_roasted_beef/colmap_0

still same runtime error.

lizhan17 commented 9 months ago

can you get your torch.__version__?

cr7Por commented 9 months ago

Python 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import torch torch.version '1.12.1+cu116'

lizhan17 commented 9 months ago

How about this “CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 python train.py ... ”

you can choose different CUDA_VISIBLE_DEVICES from 0 to 1 to 5,...

cr7Por commented 9 months ago

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 python train.py --quiet --eval --config configs/n3d_lite/cut_roasted_beef.json --model_path log/cut_beef --source_path cut_roasted_beef/colmap_0

still same runtime error, I only have one rtx3090 in my system.

lizhan17 commented 9 months ago

what is output of your nvidia-smi ?

cr7Por commented 9 months ago

Every 2.0s: nvidia-smi ubuntu: Tue Jan 2 13:28:28 2024

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1176 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1385 G /usr/bin/gnome-shell 11MiB | +-----------------------------------------------------------------------------+

cr7Por commented 9 months ago

runtime error return immediately, gpu memory usage is not changed.

lizhan17 commented 9 months ago

According to 3D Guassian's suggestion for their code and your cuda driver is 12.0 so i think the you can install Python 3.8, PyTorch 2.0.0, CUDA 12

If you can afford the disk space, we recommend using our environment files for setting up a training environment identical to ours. If you want to make modifications, please note that major version changes might affect the results of our method. However, our (limited) experiments suggest that the codebase works just fine inside a more up-to-date environment (Python 3.8, PyTorch 2.0.0, CUDA 12). Make sure to create an environment where PyTorch and its CUDA runtime version match and the installed CUDA SDK has no major version difference with PyTorch's CUDA version. https://github.com/graphdeco-inria/gaussian-splatting

cr7Por commented 9 months ago

ok, i will give it a try. thank you very much.

cr7Por commented 9 months ago

Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import torch torch.version <module 'torch.version' from '/home/ubuntu/anaconda3/envs/colmapenv/lib/python3.8/site-packages/torch/version.py'> torch.version '2.1.2+cu121'

same runtime error here.

scene/cameras.py:53 in │ │ init │ │ │ │ 50 │ │ # image is real image │ │ 51 │ │ if not isinstance(image, tuple): │ │ 52 │ │ │ if "camera_" not in image_name: │ │ ❱ 53 │ │ │ │ self.original_image = image.clamp(0.0, 1.0).to(self.data_device) │ │ 54 │ │ │ else: │ │ 55 │ │ │ │ self.original_image = image.clamp(0.0, 1.0).half().to(self.data_device) │ │ 56 │ │ │ self.image_width = self.original_image.shape[2] │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

lizhan17 commented 9 months ago

how about just

import torch

torch.ones((1, 1)).to('cuda')

cr7Por commented 9 months ago

conda activate feature_splatting (feature_splatting) ubuntu@ubuntu:~$ python Python 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import torch torch.ones((1, 1)).to('cuda') tensor([[1.]], device='cuda:0')

that is ok.

lizhan17 commented 9 months ago

how about replace self.data_device with 'cuda' in scene/cameras.py:53 in ?

cr7Por commented 9 months ago

File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 260, in cameraList_from_camInfosv2 camera_list.append(loadCamv2(args, id, c, resolution_scale)) File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 109, in loadCamv2 image_name=cam_info.image_name, uid=id, data_device=args.data_device, near=cam_info.near, far=cam_info.far, timestamp=cam_info.timestamp, rayo=rays_o, rayd=rays_d,cxr=cam_info.cxr,cyr=cam_info.cyr) File "./thirdparty/gaussian_splatting/scene/cameras.py", line 53, in init self.original_image = image.clamp(0.0, 1.0).to('cuda') RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable same error

lizhan17 commented 9 months ago

File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 260, in cameraList_from_camInfosv2 camera_list.append(loadCamv2(args, id, c, resolution_scale)) File "./thirdparty/gaussian_splatting/utils/camera_utils.py", line 109, in loadCamv2 image_name=cam_info.image_name, uid=id, data_device=args.data_device, near=cam_info.near, far=cam_info.far, timestamp=cam_info.timestamp, rayo=rays_o, rayd=rays_d,cxr=cam_info.cxr,cyr=cam_info.cyr) File "./thirdparty/gaussian_splatting/scene/cameras.py", line 53, in init self.original_image = image.clamp(0.0, 1.0).to('cuda') RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable same error

Could you print the dtype of image.clamp(0.0, 1.0) ?

cr7Por commented 9 months ago

torch.float32 [03/01 16:17:10]

yhyhdyb commented 8 months ago

same problem, solved by changing terminal @~@

FooAuto commented 6 months ago

I have encountered exactly the same problem. And I fix it by decreasing the "duration" value in the config file. I guess the problem arised from something like CUDA_OUT_OF_MEMORY. Hope this can be helpful.

oppo-us-research / SpacetimeGaussians

CUDA error: all CUDA-capable devices are busy or unavailable #5