RuntimeError: CUDA error: an illegal memory access was encountered when training

wusar commented 6 months ago

Hi! Thank you for your nice work! I'm encountering some bugs while training on the D-NeRF lego datasets.

On my first attempt, I received a "RuntimeError: numel: integer multiplication overflow" error. When I retried the training, a different error occurred: "RuntimeError: CUDA error: an illegal memory access was encountered."

Environment Details: Operating system: WSL CUDA version: 11.6 Python version: 3.7 PyTorch version: 1.13.1+cu116 torchvision 0.14.1+cu116

(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/Deformable-3D-Gaussians$ python train.py -s ~/research/3D_gen/data/D-NeRF/lego/ -m output/t
est_output --eval --is_blender
Optimizing output/test_output
Output folder: output/test_output [11/03 16:11:27]
Tensorboard not available: not logging progress [11/03 16:11:27]
Found transforms_train.json file, assuming Blender data set! [11/03 16:11:29]
Reading Training Transforms [11/03 16:11:29]
Reading Test Transforms [11/03 16:11:31]
Loading Training Cameras [11/03 16:11:32]
Loading Test Cameras [11/03 16:11:33]
Number of points at initialisation :  100000 [11/03 16:11:33]
Training progress:   2%|██                                                                              | 1000/40000 [00:15<09:51, 65.96it/s, Loss=0.2878043]Traceback (most recent call last):
  File "train.py", line 274, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 126, in training
    gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter],
RuntimeError: numel: integer multiplication overflow
Training progress:   2%|██                                                                              | 1000/40000 [00:15<09:55, 65.53it/s, Loss=0.2878043]
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/Deformable-3D-Gaussians$ python train.py -s ~/research/3D_gen/data/D-NeRF/lego/ -m output/test_output --eval --is_blender
Optimizing output/test_output
Output folder: output/test_output [11/03 16:12:02]
Tensorboard not available: not logging progress [11/03 16:12:02]
Found transforms_train.json file, assuming Blender data set! [11/03 16:12:04]
Reading Training Transforms [11/03 16:12:04]
Reading Test Transforms [11/03 16:12:06]
Loading Training Cameras [11/03 16:12:07]
Loading Test Cameras [11/03 16:12:09]
Number of points at initialisation :  100000 [11/03 16:12:09]
Training progress:   2%|██                                                                              | 1000/40000 [00:14<10:10, 63.84it/s, Loss=0.2879999]Traceback (most recent call last):
  File "train.py", line 274, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 126, in training
    gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter],
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress:   2%|██                                                                              | 1000/40000 [00:14<09:31, 68.28it/s, Loss=0.2879999]

The same error happens when I trained on another dataset hook:

(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/Deformable-3D-Gaussians$ python train.py -s ~/research/3D_gen/data/D-NeRF/hook/ -m output/t
est_output --eval --is_blender
Optimizing output/test_output
Output folder: output/test_output [11/03 16:26:35]
Tensorboard not available: not logging progress [11/03 16:26:35]
Found transforms_train.json file, assuming Blender data set! [11/03 16:26:36]
Reading Training Transforms [11/03 16:26:36]
Reading Test Transforms [11/03 16:26:40]
Generating random point cloud (100000)... [11/03 16:26:40]
Loading Training Cameras [11/03 16:26:40]
Loading Test Cameras [11/03 16:26:43]
Number of points at initialisation :  100000 [11/03 16:26:43]
Training progress:   3%|██                                                                              | 1010/40000 [00:15<09:58, 65.19it/s, Loss=0.0619599]Traceback (most recent call last):
  File "train.py", line 274, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 126, in training
    gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter],
RuntimeError: numel: integer multiplication overflow
Training progress:   3%|██                                                                              | 1010/40000 [00:15<10:05, 64.44it/s, Loss=0.0619599]

ingra14m commented 6 months ago

Hi, thanks for your interest. It seems to be a problem with the VRAM. How much VRAM does your GPU have? I recommend having more than 12GB.

wusar commented 6 months ago

I was using RTX 4060 Ti GPU with 16GB of VRAM and I didn't encounter any VRAM limitations during D-NeRF model training. I'm wondering if there might be an issue with how I handled the D-NeRF dataset. This is my dataset files organization:

(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls
D-NeRF  NeRF-DS  aleks-teapot
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls D-NeRF/
bouncingballs  hellwarrior  hook  jumpingjacks  lego  mutant  standup  trex
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls D-NeRF/lego/
points3d.ply  test  train  transforms_test.json  transforms_train.json  transforms_val.json  val

But I encountered VRAM limitations when using the Hyper-NeRF/Aleksis teapot datasets.

eformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/Deformable-3D-Gaussians$ python train.py -s ../data/aleks-teapot/ -m output/test_dnerf --eval --i
s_blender
Optimizing output/test_dnerf
Output folder: output/test_dnerf [12/03 10:53:57]
Tensorboard not available: not logging progress [12/03 10:53:57]
Found dataset.json file, assuming Nerfies data set! [12/03 10:53:59]
Reading Nerfies Info [12/03 10:53:59]
 [12/03 10:54:02]
Generating point cloud from nerfies... [12/03 10:54:02]
Loading Training Cameras [12/03 10:54:02]
Loading Test Cameras [12/03 10:54:04]
Number of points at initialisation :  11835 [12/03 10:54:04]
Training progress:   0%|                                                                                                                 | 0/40000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 274, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 100, in training
    render_pkg_re = render(viewpoint_cam, gaussians, pipe, background, d_xyz, d_rotation, d_scaling, dataset.is_6dof)
  File "/home/wusar/research/3D_gen/Deformable-3D-Gaussians/gaussian_renderer/__init__.py", line 115, in render
    cov3D_precomp=cov3D_precomp)
  File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 219, in forward
    raster_settings, 
  File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians
    raster_settings,
  File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 92, in forward
    num_rendered, color, depth, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 35.50 GiB (GPU 0; 16.00 GiB total capacity; 1.44 GiB already allocated; 12.66 GiB free; 1.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training progress:   0%|

wusar commented 6 months ago

I observed that the radii values in the model reach tensor([1197030752, 32758, 92010192, ..., -1957826993, 0, 92024608] around the 1000th epoch. This is causing integer overflow because the values exceed the maximum representable value for the int32 data type.

ingra14m commented 6 months ago

I think it's normal to encounter OOM issues with HyperNeRF because the camera poses in HyperNeRF are quite inaccurate. This is also why we use NeRF-DS instead of all HyperNeRF scenes. You can refer to this issue.

ingra14m / Deformable-3D-Gaussians

RuntimeError: CUDA error: an illegal memory access was encountered when training #41