Open wusar opened 6 months ago
Hi, thanks for your interest. It seems to be a problem with the VRAM. How much VRAM does your GPU have? I recommend having more than 12GB.
I was using RTX 4060 Ti GPU with 16GB of VRAM and I didn't encounter any VRAM limitations during D-NeRF model training. I'm wondering if there might be an issue with how I handled the D-NeRF dataset. This is my dataset files organization:
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls
D-NeRF NeRF-DS aleks-teapot
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls D-NeRF/
bouncingballs hellwarrior hook jumpingjacks lego mutant standup trex
(deformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/data$ ls D-NeRF/lego/
points3d.ply test train transforms_test.json transforms_train.json transforms_val.json val
But I encountered VRAM limitations when using the Hyper-NeRF/Aleksis teapot datasets.
eformable_gaussian_env) wusar@DESKTOP-PA3GPBB:~/research/3D_gen/Deformable-3D-Gaussians$ python train.py -s ../data/aleks-teapot/ -m output/test_dnerf --eval --i
s_blender
Optimizing output/test_dnerf
Output folder: output/test_dnerf [12/03 10:53:57]
Tensorboard not available: not logging progress [12/03 10:53:57]
Found dataset.json file, assuming Nerfies data set! [12/03 10:53:59]
Reading Nerfies Info [12/03 10:53:59]
[12/03 10:54:02]
Generating point cloud from nerfies... [12/03 10:54:02]
Loading Training Cameras [12/03 10:54:02]
Loading Test Cameras [12/03 10:54:04]
Number of points at initialisation : 11835 [12/03 10:54:04]
Training progress: 0%| | 0/40000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 274, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
File "train.py", line 100, in training
render_pkg_re = render(viewpoint_cam, gaussians, pipe, background, d_xyz, d_rotation, d_scaling, dataset.is_6dof)
File "/home/wusar/research/3D_gen/Deformable-3D-Gaussians/gaussian_renderer/__init__.py", line 115, in render
cov3D_precomp=cov3D_precomp)
File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 219, in forward
raster_settings,
File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians
raster_settings,
File "/home/wusar/miniconda3/envs/deformable_gaussian_env/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 92, in forward
num_rendered, color, depth, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 35.50 GiB (GPU 0; 16.00 GiB total capacity; 1.44 GiB already allocated; 12.66 GiB free; 1.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training progress: 0%|
I observed that the radii values in the model reach tensor([1197030752, 32758, 92010192, ..., -1957826993, 0, 92024608] around the 1000th epoch. This is causing integer overflow because the values exceed the maximum representable value for the int32 data type.
I think it's normal to encounter OOM issues with HyperNeRF because the camera poses in HyperNeRF are quite inaccurate. This is also why we use NeRF-DS instead of all HyperNeRF scenes. You can refer to this issue.
Hi! Thank you for your nice work! I'm encountering some bugs while training on the D-NeRF lego datasets.
On my first attempt, I received a "RuntimeError: numel: integer multiplication overflow" error. When I retried the training, a different error occurred: "RuntimeError: CUDA error: an illegal memory access was encountered."
Environment Details: Operating system: WSL CUDA version: 11.6 Python version: 3.7 PyTorch version: 1.13.1+cu116 torchvision 0.14.1+cu116
The same error happens when I trained on another dataset hook: