Open seohoiki3215 opened 1 year ago
Hi,
I have been trying to get to the bottom of this, but was unable to reproduce it so far. Would you by any chance be available for a Skype (or similar) session to run through it?
Also one question: I see the message " Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51] "
Are you in fact running it on the Blender data set?
I'm running the code with nerf_synthetic dataset. The colleague I mentioned successed running your code on the exact same dataset. Link: https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1
And for the request of Skype session, can you make it with zoom?
Thanks for suggesting, but I did a debug session now with another user for the same problem. It looks like I will need to add more diagnostics before I can find out what's going on. I'll let you know when I find out more :)
Hi @seohoiki3215 I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do
git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization
and then run what failed before with --debug
. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration>
to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw
or snapshot_bw
file in the gaussian_splatting
directory. If you could forward this file to us, we could take a look to see if we find something wrong!
Best, Bernhard
Thank you for giving me some updates for the issue. I've re-run the code with the procedure, and here is the result!
Optimizing
Output folder: ./output/9feda2d2-9 [24/07 11:03:34]
Tensorboard not available: not logging progress [24/07 11:03:34]
Found transforms_train.json file, assuming Blender data set! [24/07 11:03:34]
Reading Training Transforms [24/07 11:03:34]
Reading Test Transforms [24/07 11:03:36]
Loading Training Cameras [24/07 11:03:40]
Loading Test Cameras [24/07 11:03:42]
Number of points at initialisation : 100000 [24/07 11:03:42]
Training progress: 0%|
| 0/30000 [00:00<?, ?it/s]
[CUDA ERROR] in cuda_rasterizer/rasterizer_impl.cu
Line 298: an illegal memory access was encountered
An error occured in forward. Please forward snapshot_fw.dump for debugging. [24/07 11:03:42]
Traceback (most recent call last):
File "train.py", line 216, in
Hi,
so I tried it, unfortunately it just works for me, the state you submitted is valid. I have to say I'm running out of ideas what this could be ☹️. I have only seen the issue happen on Linux so far. Are there other GPUs in your machine? Are your GPU drivers up to date?
Best, Bernhard
I am sorry to hear that the error is not reproducible. ;( I have a single RTX4090 on my system and for driver, it's up to date(535). For CUDA toolkit, , it's version is 11.7
I also encounter this error. Any help/update? Here is the debug message I got:
[CUDA ERROR] in /home/gaussian-splatting/submodules/diff-gaussian-rasterization/cuda_rasterizer/rasterizer_impl.cu
Line 298: an illegal memory access was encountered
An error occured in forward. Please forward snapshot_fw.dump for debugging. [05/09 01:11:48]
Traceback (most recent call last):
File "train.py", line 216, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 83, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/gaussian-splatting/gaussian_renderer/__init__.py", line 93, in render
cov3D_precomp = cov3D_precomp)
File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 219, in forward
raster_settings,
File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians
raster_settings,
File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 90, in forward
raise ex
File "/home//miniconda3/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 86, in forward
num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
RuntimeError: an illegal memory access was encountered
@seohoiki3215 did you manage to resolve this?
@Snosixtyboo This is my dump file snapshot_fw.zip obtained with the debug version of the rasterizer.
This is my error:
num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
I'm running with ubuntu 20.04 cuda 11.8 RTX3090 driver 520. I was wondering if you have any advice on how to resolve this?
@fatbao55 Please check this PR, https://github.com/graphdeco-inria/diff-gaussian-rasterization/pull/10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.
...
29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]})
...
After changing the code, reinstall the module by
pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization
@jsl013 This worked for me, thanks so much!
@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.
... 29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]}) ...
After changing the code, reinstall the module by
pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization
This is a life saver for me, after two days of debugging and tried 4 different clusters, this finally help me to solve the problem on ubuntu.
@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.
... 29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]}) ...
After changing the code, reinstall the module by
pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization
Had same issue with diff-gaussian-rasterization
as well. This solves it for me. I am running on a WSL2 Ubuntu-20.04 setup with Cuda 11.8 toolkit.
Hi @seohoiki3215 I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do
git pull git submodule update pip uninstall diff-gaussian-rasterization (yes) pip install submodules/diff-gaussian-rasterization
and then run what failed before with
--debug
. This is slow: so if it takes a while for the error to appear, you can also use--debug_from <iteration>
to start debugging only at a certain point. If everything goes well, you should get an error message and asnapshot_fw
orsnapshot_bw
file in thegaussian_splatting
directory. If you could forward this file to us, we could take a look to see if we find something wrong!Best, Bernhard
ORZ, I have installed the debug
version. Could anyone tell me how to use the '--debug' arg? I add it to the render.py
but got the following error...
Input:
python render.py --debug ...
Output:
usage: render.py [-h] [--sh_degree SH_DEGREE] [--source_path SOURCE_PATH]
[--model_path MODEL_PATH] [--images IMAGES]
[--resolution RESOLUTION] [--white_background] [--eval]
[--convert_SHs_python] [--compute_cov3D_python]
[--iteration ITERATION]
@fatbao55 Please check this PR, graphdeco-inria/diff-gaussian-rasterization#10 For my case, adding "-Xcompiler -fno-gnu-unique" option in submodules/diff-gaussian-rasterization/setup.py: line 29 resolves the illegal memory access error in training.
... 29 extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique","-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]}) ...
After changing the code, reinstall the module by
pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization
This works for me! I appreciate your kind tip!
Hello, I was surprised by your work and tried to reproduce it with the code you've provided. However, every time I tried to run the code, it always failed to run with the runtime error i mentioned on the title.
Traceback (most recent call last): File "train.py", line 213, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
File "train.py", line 87, in training
loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image))
File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
I tried all the methods you've told in other issues, but failed. My system & settings: RTX4090 Ubuntu 22.04 LTS Exact environment with given .yml file
Strangely, my colleague who has system with RTX 3090 / Ubuntu 20.04 runs the code without any problem.(Except them, all the settings are exactly the same including CUDA SDK version)
I hope I can get some solution for this problem!
Thank you.
===================================== Results with cuda-memcheck
========= CUDA-MEMCHECK ========= This tool is deprecated and will be removed in a future release of the CUDA toolkit ========= Please use the compute-sanitizer tool as a drop-in replacement Optimizing Output folder: ./output/54877260-0 [17/07 19:21:51] Tensorboard not available: not logging progress [17/07 19:21:51] Found transforms_train.json file, assuming Blender data set! [17/07 19:21:51] Reading Training Transforms [17/07 19:21:51] Reading Test Transforms [17/07 19:21:53] Loading Training Cameras [17/07 19:21:56] Loading Test Cameras [17/07 19:21:57] Number of points at initialisation : 100000 [17/07 19:21:57] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
File "train.py", line 87, in training
loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image))
File "/home/seohoiki/Research/NeRF/gaussian-splatting/utils/loss_utils.py", line 38, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
========= ERROR SUMMARY: 0 errors