CUDA out of memory [nerfstudio 1.1.2, gsplat 1.0.0]

a11enL commented 2 weeks ago

nerfstudio 1.1.2, gsplat 1.0.0

ns-train splatfacto ... --downscale-factor 1...

the dataset with 1116 4K(3840x2160) images

Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in sys.exit(entrypoint()) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 262, in entrypoint main( File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 247, in main launch( File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 189, in launch main_func(local_rank=0, world_size=world_size, config=config) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/scripts/train.py", line 100, in train_loop trainer.train() File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/engine/trainer.py", line 261, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/utils/profiler.py", line 112, in inner out = func(*args, *kwargs) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/engine/trainer.py", line 496, in trainiteration , loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/utils/profiler.py", line 112, in inner out = func(args, **kwargs) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/pipelines/base_pipeline.py", line 302, in get_train_loss_dict metrics_dict = self.model.get_metrics_dict(model_outputs, batch) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/models/splatfacto.py", line 801, in get_metrics_dict gt_rgb = self.composite_with_background(self.get_gt_img(batch["image"]), outputs["background"]) File "/home/ubuntu/nerfstudio-1.1.2/nerfstudio/models/splatfacto.py", line 777, in get_gt_img image = image.float() / 255.0 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 21.67 GiB of which 93.75 MiB is free. Process 1438657 has 21.51 GiB memory in use. Of the allocated memory 21.22 GiB is allocated by PyTorch, and 50.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

but it works for nerfstudio 1.1.0/gsplat 0.1.11 with this same1116 4K(3840x2160) images dataset. and nerfstudio 1.1.2/gsplat 1.0.0 works for a 690 4K(3840x2160) images dataset.

I also tried following for nerfstudio 1.1.2/gsplat 1.0.0, not work PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32

MrDieAlot commented 2 weeks ago

I've encountered the same problem as well, there seems to be something in the Gsplat 1.0 code that assumes that images have been downscaled to at least 1920x1080. When using resolutions above that, it starts allocating very large amounts of shared GPU memory for some reason (dedicated GPU memory usage seems to stay roughly the same).

jb-ye commented 2 weeks ago

Latest change in gsplat 1.0 changes the default parameter of cache_images to "gpu", you can try add:

--pipeline.datamanager.cache_images cpu

@liruilong940607 Could you elaborate why you want to change this default value?

liruilong940607 commented 2 weeks ago

That's a change made by @kerrj when he was investigating the dataloader efficiency. @kerrj do you know how much this would speedup things?

a11enL commented 2 weeks ago

Latest change in gsplat 1.0 changes the default parameter of cache_images to "gpu", you can try add:

--pipeline.datamanager.cache_images cpu

@liruilong940607 Could you elaborate why you want to change this default value?

Appreciate for all comments, I checked this and still not working.

ns-train splatfacto --machine.num-devices 1 --pipeline.datamanager.masks-on-gpu False --pipeline.datamanager.cache_images cpu --max-num-iterations 30001 --data data/bolzc/ nerfstudio-data --downscale-factor 1

...... pipeline=VanillaPipelineConfig( _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>, datamanager=FullImageDatamanagerConfig( _target=<class 'nerfstudio.data.datamanagers.full_images_datamanager.FullImageDatamanager'>, data=PosixPath('data/bolzc'), masks_on_gpu=False, images_on_gpu=False, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=PosixPath('.'), scale_factor=1.0, downscale_factor=1, scene_scale=1.0, orientation_method='up', center_method='poses', auto_scale_poses=True, eval_mode='fraction', train_split_fraction=0.9, eval_interval=8, depth_unit_scale_factor=0.001, mask_color=None, load_3D_points=True ), camera_res_scale_factor=1.0, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), cache_images='cpu', cache_images_type='uint8', max_thread_workers=None, train_cameras_sampling_strategy='random', train_cameras_sampling_seed=42, fps_reset_every=100 ), ......

torch.cuda.OutOfMemoryError: CUDA out of memory. ...... .......

jb-ye commented 2 weeks ago

How large is your CPU RAM and GPU VRAM? @a11enL

a11enL commented 2 weeks ago

How large is your CPU RAM and GPU VRAM? @a11enL

256G RAM and 22G VRAM for GPU 0

What I care is that nerfstudio1.1.0 works but nerfstudio1.1.2 didn't work for this dataset with same GPU.

liruilong940607 commented 2 weeks ago

i have a guess -- rasterization in gsplat 1.0 would return a dict of meta info, that stores ALL intermediate results in the rasterization process. And because they are returned, the memory allocated by them wouldn't be free until the forward() call is done. This is the only thing that I can think of that could lead to higher memory usage than before (on the gsplat part).

I think i should have some cycles to look into this next week

nerfstudio-project / nerfstudio

CUDA out of memory [nerfstudio 1.1.2, gsplat 1.0.0] #3214