graphdeco-inria / gaussian-splatting

Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"
https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Other
13.7k stars 1.77k forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #41

Open seohoiki3215 opened 1 year ago

seohoiki3215 commented 1 year ago

Hello, I was surprised by your work and tried to reproduce it with the code you've provided. However, every time I tried to run the code, it always failed to run with the runtime error i mentioned on the title.

Traceback (most recent call last): File "train.py", line 213, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

I tried all the methods you've told in other issues, but failed. My system & settings: RTX4090 Ubuntu 22.04 LTS Exact environment with given .yml file

Strangely, my colleague who has system with RTX 3090 / Ubuntu 20.04 runs the code without any problem.(Except them, all the settings are exactly the same including CUDA SDK version)

I hope I can get some solution for this problem!

Thank you.

===================================== Results with cuda-memcheck

========= CUDA-MEMCHECK ========= This tool is deprecated and will be removed in a future release of the CUDA toolkit ========= Please use the compute-sanitizer tool as a drop-in replacement Optimizing Output folder: ./output/54877260-0 [17/07 19:21:51] Tensorboard not available: not logging progress [17/07 19:21:51] Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51] Reading Training Transforms [17/07 19:21:51] Reading Test Transforms [17/07 19:21:53] Loading Training Cameras [17/07 19:21:56] Loading Test Cameras [17/07 19:21:57] Number of points at initialisation : 100000 [17/07 19:21:57] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s] ========= ERROR SUMMARY: 0 errors

Snosixtyboo commented 1 year ago

Hi,

I have been trying to get to the bottom of this, but was unable to reproduce it so far. Would you by any chance be available for a Skype (or similar) session to run through it?

Snosixtyboo commented 1 year ago

Also one question: I see the message " Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51] "

Are you in fact running it on the Blender data set?

seohoiki3215 commented 1 year ago

I'm running the code with nerf_synthetic dataset. The colleague I mentioned successed running your code on the exact same dataset. Link: https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1

And for the request of Skype session, can you make it with zoom?

Snosixtyboo commented 1 year ago

Thanks for suggesting, but I did a debug session now with another user for the same problem. It looks like I will need to add more diagnostics before I can find out what's going on. I'll let you know when I find out more :)

Snosixtyboo commented 1 year ago

Hi @seohoiki3215 I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do

git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization

and then run what failed before with --debug. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration> to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw or snapshot_bw file in the gaussian_splatting directory. If you could forward this file to us, we could take a look to see if we find something wrong!

Best, Bernhard

seohoiki3215 commented 1 year ago

Thank you for giving me some updates for the issue. I've re-run the code with the procedure, and here is the result!

snapshot_fw.zip

Optimizing Output folder: ./output/9feda2d2-9 [24/07 11:03:34] Tensorboard not available: not logging progress [24/07 11:03:34] Found transforms_train.json file, assuming Blender data set! [24/07 11:03:34] Reading Training Transforms [24/07 11:03:34] Reading Test Transforms [24/07 11:03:36] Loading Training Cameras [24/07 11:03:40] Loading Test Cameras [24/07 11:03:42] Number of points at initialisation : 100000 [24/07 11:03:42] Training progress: 0%|
| 0/30000 [00:00<?, ?it/s] [CUDA ERROR] in cuda_rasterizer/rasterizer_impl.cu Line 298: an illegal memory access was encountered An error occured in forward. Please forward snapshot_fw.dump for debugging. [24/07 11:03:42] Traceback (most recent call last): File "train.py", line 216, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from) File "train.py", line 83, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/seohoiki/Research/NeRF/gaussian-splatting/gaussian_renderer/init.py", line 93, in render cov3D_precomp = cov3D_precomp) File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 219, in forward raster_settings, File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 41, in rasterize_gaussians raster_settings, File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 90, in forward raise ex File "/home/seohoiki/anaconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/init.py", line 86, in forward num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(args) RuntimeError: an illegal memory access was encountered Training progress: 0%|

Snosixtyboo commented 1 year ago

Hi,

so I tried it, unfortunately it just works for me, the state you submitted is valid. I have to say I'm running out of ideas what this could be ☹️. I have only seen the issue happen on Linux so far. Are there other GPUs in your machine? Are your GPU drivers up to date?

Best, Bernhard

seohoiki3215 commented 1 year ago

I am sorry to hear that the error is not reproducible. ;( I have a single RTX4090 on my system and for driver, it's up to date(535). For CUDA toolkit, , it's version is 11.7

stevenygd commented 1 year ago

I also encounter this error. Any help/update? Here is the debug message I got:

[CUDA ERROR] in /home/gaussian-splatting/submodules/diff-gaussian-rasterization/cuda_rasterizer/rasterizer_impl.cu
Line 298: an illegal memory access was encountered
An error occured in forward. Please forward snapshot_fw.dump for debugging. [05/09 01:11:48]
Traceback (most recent call last):
  File "train.py", line 216, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 83, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, background)
  File "/home/gaussian-splatting/gaussian_renderer/__init__.py", line 93, in render
    cov3D_precomp = cov3D_precomp)
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 219, in forward
    raster_settings, 
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians
    raster_settings,
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 90, in forward
    raise ex
  File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 86, in forward
    num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
RuntimeError: an illegal memory access was encountered
fatbao55 commented 11 months ago

@seohoiki3215 did you manage to resolve this?

fatbao55 commented 11 months ago

@Snosixtyboo This is my dump file snapshot_fw.zip obtained with the debug version of the rasterizer.

This is my error: num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

I'm running with ubuntu 20.04 cuda 11.8 RTX3090 driver 520. I was wondering if you have any advice on how to resolve this?

junseo013 commented 11 months ago

@fatbao55 Please check this PR, https://github.com/graphdeco-inria/diff-gaussian-rasterization/pull/10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

fatbao55 commented 11 months ago

@jsl013 This worked for me, thanks so much!

FantasticOven2 commented 10 months ago

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

This is a life saver for me, after two days of debugging and tried 4 different clusters, this finally help me to solve the problem on ubuntu.

mushroonhead commented 10 months ago

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

Had same issue with diff-gaussian-rasterization as well. This solves it for me. I am running on a WSL2 Ubuntu-20.04 setup with Cuda 11.8 toolkit.

ShuzhaoXie commented 10 months ago

Hi @seohoiki3215 I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do

git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization

and then run what failed before with --debug. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration> to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw or snapshot_bw file in the gaussian_splatting directory. If you could forward this file to us, we could take a look to see if we find something wrong!

Best, Bernhard

ORZ, I have installed the debug version. Could anyone tell me how to use the '--debug' arg? I add it to the render.py but got the following error...

Input:

python render.py --debug ...

Output:

usage: render.py [-h] [--sh_degree SH_DEGREE] [--source_path SOURCE_PATH]
                    [--model_path MODEL_PATH] [--images IMAGES]
                    [--resolution RESOLUTION] [--white_background] [--eval]
                    [--convert_SHs_python] [--compute_cov3D_python]
                    [--iteration ITERATION]
jhq1234 commented 8 months ago

@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.

...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...

After changing the code, reinstall the module by pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization

This works for me! I appreciate your kind tip!