OOM issue and blank validation results using mcmc strategy

Hi, I've tried mcmc strategy on smaller subset (about 542 images, 150, 000 initial point cloud), which works quite good. But when I'm using the whole dataset (about 973 images, 360, 000 initial point cloud). After about 6400 steps. It will raise cuda oom issue. And when I was trying to lower the cam_max to 500_000 for each GPU, the validation results would be blank images. I tested with the default strategy and it works well.

My command:

CUDA_VISIBLE_DEVICES=0, 1, 2, 3, 4, 5

python examples/simple_trainer.py mcmc \ --data_dir {My_DATASET_DIR} \ --data_factor 1 \ --result_dir ./results/{MY_OUTPUT_DIR} \ --max_steps 50_000 \ --eval_steps 7_000 30_000 40_000 50_000 \ --save_steps 7_000 30_000 40_000 50_000 \ --use_bilateral_grid \

My log: 2024-11-12 14:31:14.833 Step 6200: Relocated 934401 GSs. 2024-11-12 14:31:14.833 Step 6200: Added 46996 GSs. Now having 986928 GSs. 2024-11-12 14:31:14.833 Step 6300: Relocated 984383 GSs. 2024-11-12 14:31:14.833 Step 6300: Added 13072 GSs. Now having 1000000 GSs. 2024-11-12 14:31:14.833 Step 6400: Relocated 995335 GSs. 2024-11-12 14:31:14.833 Step 6400: Added 0 GSs. Now having 1000000 GSs. 2024-11-12 14:31:18.561 Traceback (most recent call last): 2024-11-12 14:31:18.561 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1076, in 2024-11-12 14:31:18.575 cli(main, cfg, verbose=True) 2024-11-12 14:31:18.575 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 344, in cli 2024-11-12 14:31:18.579 process_context.join() 2024-11-12 14:31:18.579 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join 2024-11-12 14:31:18.580 raise ProcessRaisedException(msg, error_index, failed_process.pid) 2024-11-12 14:31:18.580 torch.multiprocessing.spawn.ProcessRaisedException: 2024-11-12 14:31:18.580

2024-11-12 14:31:18.580 -- Process 0 terminated with the following error: 2024-11-12 14:31:18.580 Traceback (most recent call last): 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap 2024-11-12 14:31:18.580 fn(i, args) 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 295, in _distributed_worker 2024-11-12 14:31:18.580 fn(local_rank, world_rank, world_size, args) 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1021, in main 2024-11-12 14:31:18.580 runner.train() 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 589, in train 2024-11-12 14:31:18.580 renders, alphas, info = self.rasterize_splats( 2024-11-12 14:31:18.580 File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 469, in rasterize_splats 2024-11-12 14:31:18.580 render_colors, render_alphas, info = rasterization( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/rendering.py", line 497, in rasterization 2024-11-12 14:31:18.580 tiles_per_gauss, isect_ids, flatten_ids = isect_tiles( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-11-12 14:31:18.580 return func(args, *kwargs) 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 382, in isect_tiles 2024-11-12 14:31:18.580 tiles_per_gauss, isect_ids, flatten_ids = _make_lazy_cuda_func("isect_tiles")( 2024-11-12 14:31:18.580 File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 14, in call_cuda 2024-11-12 14:31:18.580 return getattr(_C, name)(args, **kwargs) 2024-11-12 14:31:18.580 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.24 GiB. GPU 0 has a total capacty of 39.42 GiB of which 3.14 GiB is free. Process 126936 has 36.29 GiB memory in use. Of the allocated memory 30.47 GiB is allocated by PyTorch, and 2.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nerfstudio-project / gsplat

OOM issue and blank validation results using mcmc strategy #487