Open a11enL opened 2 weeks ago
I've encountered the same problem as well, there seems to be something in the Gsplat 1.0 code that assumes that images have been downscaled to at least 1920x1080. When using resolutions above that, it starts allocating very large amounts of shared GPU memory for some reason (dedicated GPU memory usage seems to stay roughly the same).
Latest change in gsplat 1.0 changes the default parameter of cache_images to "gpu", you can try add:
--pipeline.datamanager.cache_images cpu
@liruilong940607 Could you elaborate why you want to change this default value?
That's a change made by @kerrj when he was investigating the dataloader efficiency. @kerrj do you know how much this would speedup things?
Latest change in gsplat 1.0 changes the default parameter of cache_images to "gpu", you can try add:
--pipeline.datamanager.cache_images cpu
@liruilong940607 Could you elaborate why you want to change this default value?
Appreciate for all comments, I checked this and still not working.
ns-train splatfacto --machine.num-devices 1 --pipeline.datamanager.masks-on-gpu False --pipeline.datamanager.cache_images cpu --max-num-iterations 30001 --data data/bolzc/ nerfstudio-data --downscale-factor 1
...... pipeline=VanillaPipelineConfig( _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>, datamanager=FullImageDatamanagerConfig( _target=<class 'nerfstudio.data.datamanagers.full_images_datamanager.FullImageDatamanager'>, data=PosixPath('data/bolzc'), masks_on_gpu=False, images_on_gpu=False, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=PosixPath('.'), scale_factor=1.0, downscale_factor=1, scene_scale=1.0, orientation_method='up', center_method='poses', auto_scale_poses=True, eval_mode='fraction', train_split_fraction=0.9, eval_interval=8, depth_unit_scale_factor=0.001, mask_color=None, load_3D_points=True ), camera_res_scale_factor=1.0, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), cache_images='cpu', cache_images_type='uint8', max_thread_workers=None, train_cameras_sampling_strategy='random', train_cameras_sampling_seed=42, fps_reset_every=100 ), ......
torch.cuda.OutOfMemoryError: CUDA out of memory. ...... .......
How large is your CPU RAM and GPU VRAM? @a11enL
How large is your CPU RAM and GPU VRAM? @a11enL
256G RAM and 22G VRAM for GPU 0
What I care is that nerfstudio1.1.0 works but nerfstudio1.1.2 didn't work for this dataset with same GPU.
i have a guess -- rasterization
in gsplat 1.0 would return a dict of meta info, that stores ALL intermediate results in the rasterization process. And because they are returned, the memory allocated by them wouldn't be free until the forward()
call is done. This is the only thing that I can think of that could lead to higher memory usage than before (on the gsplat part).
I think i should have some cycles to look into this next week
nerfstudio 1.1.2, gsplat 1.0.0
ns-train splatfacto ... --downscale-factor 1...
the dataset with 1116 4K(3840x2160) images
Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in
sys.exit(entrypoint())
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 262, in entrypoint
main(
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 247, in main
launch(
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 100, in train_loop
trainer.train()
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/engine/trainer.py", line 261, in train
loss, loss_dict, metrics_dict = self.train_iteration(step)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/utils/profiler.py", line 112, in inner
out = func(*args, *kwargs)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/engine/trainer.py", line 496, in trainiteration
, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/utils/profiler.py", line 112, in inner
out = func(args, **kwargs)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/pipelines/base_pipeline.py", line 302, in get_train_loss_dict
metrics_dict = self.model.get_metrics_dict(model_outputs, batch)
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/models/splatfacto.py", line 801, in get_metrics_dict
gt_rgb = self.composite_with_background(self.get_gt_img(batch["image"]), outputs["background"])
File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/models/splatfacto.py", line 777, in get_gt_img
image = image.float() / 255.0
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 21.67 GiB of which 93.75 MiB is free. Process 1438657 has 21.51 GiB memory in use. Of the allocated memory 21.22 GiB is allocated by PyTorch, and 50.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
but it works for nerfstudio 1.1.0/gsplat 0.1.11 with this same1116 4K(3840x2160) images dataset. and nerfstudio 1.1.2/gsplat 1.0.0 works for a 690 4K(3840x2160) images dataset.
I also tried following for nerfstudio 1.1.2/gsplat 1.0.0, not work PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32