RuntimeError: CUDA error: an illegal memory access was encountered

mengxuyiGit commented 1 month ago

I wil l encounter illegal memory access for all datatsets and at different places:

for MipNerf360, it trained for 10 iters and report error at "visibility_filter" : radii > 0, see below:

Output folder: output/m360/garden [11/08 19:43:50]
Tensorboard not available: not logging progress [11/08 19:43:50]
Reading camera 185/185 [11/08 19:43:57]
Loading Training Cameras [11/08 19:43:57]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [11/08 19:43:57]
Loading Test Cameras [11/08 19:46:29]
Number of points at initialisation :  138766 [11/08 19:46:29]
Training progress:   0%| | 10/30000 [00:00<28:32, 17.52it/s, Loss=0.52806, distort=0.0Traceback (most recent call last):
  File "train.py", line 277, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
  File "train.py", line 69, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, background)
  File "/home/xuyimeng/Repo/2d-gaussian-splatting/gaussian_renderer/__init__.py", line 112, in render
    "visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:   0%| | 10/30000 [00:01<1:00:41,  8.24it/s, Loss=0.52806, distort=0

And for nerf_synthetic/lego, it even cannot render at iter0:

Output folder: output/nerf_synthetic/lego [11/08 19:41:39]
Tensorboard not available: not logging progress [11/08 19:41:39]
Found transforms_train.json file, assuming Blender data set! [11/08 19:41:39]
Reading Training Transforms [11/08 19:41:39]
Reading Test Transforms [11/08 19:41:44]
Loading Training Cameras [11/08 19:41:54]
Loading Test Cameras [11/08 19:42:07]
Number of points at initialisation :  100000 [11/08 19:42:07]
Training progress:   0%|                                    | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 277, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)
  File "train.py", line 69, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, background)
  File "/home/xuyimeng/Repo/2d-gaussian-splatting/gaussian_renderer/__init__.py", line 144, in render
    surf_normal = depth_to_normal(viewpoint_camera, surf_depth)
  File "/home/xuyimeng/Repo/2d-gaussian-splatting/utils/point_utils.py", line 31, in depth_to_normal
    points = depths_to_points(view, depth).reshape(*depth.shape[1:], 3)
  File "/home/xuyimeng/Repo/2d-gaussian-splatting/utils/point_utils.py", line 10, in depths_to_points
    c2w = (view.world_view_transform.T).inverse()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

mengxuyiGit commented 1 month ago

The tricks that worked for 3DGS (such as TORCH_ARCH_LIST and extra_compile_args={"nvcc": ["-Xcompiler", "-fno-gnu-unique", "-I" + os.path.join(os.path.dirname(os.path.abspath(__file__)), "third_party/glm/")]}) will not work anymore)

hbb1 commented 1 month ago

Hi, which commit did you use? In my latest commit, I cannot reproduce the error.

wangyuanbiubiubiu commented 4 weeks ago

我也遇到了同样的问题

hbb1 commented 4 weeks ago

Did you resolve it? @mengxuyiGit

FYI: Someone ask me for the same issue recently, and we eventually found the bug: the glm version is mismatch (maybe being updated occasionally).

Mohith1012 commented 3 weeks ago

Even I'm facing the same issue Can anyone please help me with this Thanks in advance!

hbb1 commented 3 weeks ago

@Mohith1012 Have you checked the diff-surfel-rasterizer (61cb85a) and the glm (5c46b9c) version? After that you can rebuild the engine:

cd submodules/diff-surfel-rasterization
rm -rf build
pip install .

Mohith1012 commented 3 weeks ago

I see that version of diff_surfel_rasterizer is 0.0.1 I don't know how to check the glm version (Also I don't have any clue what glm means)

hbb1 commented 3 weeks ago

You can use git log to see the latest commit id. Be sure the id should match the version I provided above and rebuild the engine.

Mohith1012 commented 3 weeks ago

Oh I see the mistake, glm version is not matching. Let me update it Thanks a lot man!!

mengxuyiGit commented 2 weeks ago

I switch to a different pytorch & cuda version and it finally works: torch==2.1.3 and cuda=12.1

mengxuyiGit commented 2 weeks ago

Thanks for replying!

hbb1 / 2d-gaussian-splatting

RuntimeError: CUDA error: an illegal memory access was encountered #139