nerfstudio-project / gsplat

CUDA accelerated rasterization of gaussian splatting
https://docs.gsplat.studio/
Apache License 2.0
2.25k stars 290 forks source link

OOM issue and blank validation results using mcmc strategy #487

Open LaFeuilleMorte opened 1 week ago

LaFeuilleMorte commented 1 week ago

Hi, I've tried mcmc strategy on smaller subset (about 542 images, 150, 000 initial point cloud), which works quite good. But when I'm using the whole dataset (about 973 images, 360, 000 initial point cloud). After about 6400 steps. It will raise cuda oom issue. And when I was trying to lower the cam_max to 500_000 for each GPU, the validation results would be blank images. I tested with the default strategy and it works well.

My command:

CUDA_VISIBLE_DEVICES=0, 1, 2, 3, 4, 5

python examples/simple_trainer.py mcmc \ --data_dir {My_DATASET_DIR} \ --data_factor 1 \ --result_dir ./results/{MY_OUTPUT_DIR} \ --max_steps 50_000 \ --eval_steps 7_000 30_000 40_000 50_000 \ --save_steps 7_000 30_000 40_000 50_000 \ --use_bilateral_grid \

My log: 2024-11-12 14:31:14.833 Step 6200: Relocated 934401 GSs. 2024-11-12 14:31:14.833 Step 6200: Added 46996 GSs. Now having 986928 GSs. 2024-11-12 14:31:14.833 Step 6300: Relocated 984383 GSs. 2024-11-12 14:31:14.833 Step 6300: Added 13072 GSs. Now having 1000000 GSs. 2024-11-12 14:31:14.833 Step 6400: Relocated 995335 GSs. 2024-11-12 14:31:14.833 Step 6400: Added 0 GSs. Now having 1000000 GSs. 2024-11-12 14:31:18.561 Traceback (most recent call last): 2024-11-12 14:31:18.561 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1076, in 2024-11-12 14:31:18.575 cli(main, cfg, verbose=True) 2024-11-12 14:31:18.575 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 344, in cli 2024-11-12 14:31:18.579 process_context.join() 2024-11-12 14:31:18.579 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join 2024-11-12 14:31:18.580 raise ProcessRaisedException(msg, error_index, failed_process.pid) 2024-11-12 14:31:18.580 torch.multiprocessing.spawn.ProcessRaisedException: 2024-11-12 14:31:18.580

2024-11-12 14:31:18.580 -- Process 0 terminated with the following error: 2024-11-12 14:31:18.580 Traceback (most recent call last): 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap 2024-11-12 14:31:18.580 fn(i, args) 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 295, in _distributed_worker 2024-11-12 14:31:18.580 fn(local_rank, world_rank, world_size, args) 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1021, in main 2024-11-12 14:31:18.580 runner.train() 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 589, in train 2024-11-12 14:31:18.580 renders, alphas, info = self.rasterize_splats( 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 469, in rasterize_splats 2024-11-12 14:31:18.580 render_colors, render_alphas, info = rasterization( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/rendering.py", line 497, in rasterization 2024-11-12 14:31:18.580 tiles_per_gauss, isect_ids, flatten_ids = isect_tiles( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-11-12 14:31:18.580 return func(args, *kwargs) 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 382, in isect_tiles 2024-11-12 14:31:18.580 tiles_per_gauss, isect_ids, flatten_ids = _make_lazy_cuda_func("isect_tiles")( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 14, in call_cuda 2024-11-12 14:31:18.580 return getattr(_C, name)(args, **kwargs) 2024-11-12 14:31:18.580 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.24 GiB. GPU 0 has a total capacty of 39.42 GiB of which 3.14 GiB is free. Process 126936 has 36.29 GiB memory in use. Of the allocated memory 30.47 GiB is allocated by PyTorch, and 2.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

LaFeuilleMorte commented 1 week ago

Alright, I've known why this could happen. Because my dataset was twice larger than the previous one. And there are many noises in my point cloud. Consequently, the mcmc strategy under current configs will produce very large scale gaussians which prevent the gaussians to fit the scene. And according to this issue:

https://github.com/nerfstudio-project/gsplat/issues/464#issue-2608844974

The large scales gaussian will cause the above error in function "isect_tiles"

So I use a larger scale_reg (scale_reg=0.05). And the problem seemed to disappear. But I'm not sure if I choose the optimum scale_reg cofficient.