nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.18k stars 1.23k forks source link

Free memory available, but DefaultCPUAllocator: not enough memory #2970

Closed vindia9 closed 6 months ago

vindia9 commented 6 months ago

Describe the bug I am trying to train a dataset of 1061 images on a Win10 machine with 128GB of memory (120GB available free) and an Nvidia 3090. This is my training command: ns-train nerfacto-huge --max-num-iterations 50000 --vis wandb --data D:\nerf\v-tree --output-dir D:\nerf\v-tree

When the training starts, it gradually consumes my memory during the loading process, totaling about 86GB, at which point there is still surplus memory. However, as soon as the loading is complete, the training immediately throws an error and stops.

RuntimeError: [enforce fail at alloc_cpu.cpp:80] data. DefaultCPUAllocator: not enough memory: you tried to allocate 93361766400 bytes.

Full Log

``` (nerfstudio) D:\nerf\> ns-train nerfacto-huge --max-num-iterations 50000 --vis wandb --data D:\nerf\v-tree --output-dir D:\nerf\v-tree [23:25:48] Using --data alias for --data.pipeline.datamanager.data train.py:230 ──────── Config ──────── TrainerConfig( _target=, output_dir=WindowsPath('D:/nerf/v-tree'), method_name='nerfacto', experiment_name=None, project_name='nerfstudio-project', timestamp='2024-02-28_232548', machine=MachineConfig(seed=42, num_devices=1, num_machines=1, machine_rank=0, dist_url='auto', device_type='cuda'), logging=LoggingConfig( relative_log_dir=WindowsPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=, enable=True, stats_to_track=( , , , , , ), max_log_size=10 ), profiler='basic' ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', websocket_port=None, websocket_port_default=7007, websocket_host='0.0.0.0', num_rays_per_chunk=32768, max_num_display_images=512, quit_on_train_completion=False, image_format='jpeg', jpeg_quality=75, make_share_url=False, camera_frustum_scale=0.1, default_composite_depth=True ), pipeline=VanillaPipelineConfig( _target=, datamanager=ParallelDataManagerConfig( _target=, data=WindowsPath('D:/nerf/v-tree'), masks_on_gpu=False, images_on_gpu=False, dataparser=NerfstudioDataParserConfig( _target=, data=WindowsPath('.'), scale_factor=1.0, downscale_factor=None, scene_scale=1.0, orientation_method='up', center_method='poses', auto_scale_poses=True, eval_mode='fraction', train_split_fraction=0.9, eval_interval=8, depth_unit_scale_factor=0.001, mask_color=None, load_3D_points=False ), train_num_rays_per_batch=16384, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=4096, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), collate_fn=, camera_res_scale_factor=1.0, patch_size=1, camera_optimizer=None, pixel_sampler=PixelSamplerConfig( _target=, num_rays_per_batch=4096, keep_full_image=False, is_equirectangular=False, ignore_mask=False, fisheye_crop_radius=None, rejection_sample_mask=True, max_num_iterations=100 ), num_processes=1, queue_size=2, max_thread_workers=None ), model=NerfactoModelConfig( _target=, enable_collider=True, collider_params={'near_plane': 2.0, 'far_plane': 6.0}, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=32768, prompt=None, near_plane=0.05, far_plane=1000.0, background_color='last_sample', hidden_dim=256, hidden_dim_color=256, hidden_dim_transient=64, num_levels=16, base_res=16, max_res=8192, log2_hashmap_size=21, features_per_level=2, num_proposal_samples_per_ray=(512, 512), num_nerf_samples_per_ray=64, proposal_update_every=5, proposal_warmup=5000, num_proposal_iterations=2, use_same_proposal_network=False, proposal_net_args_list=[ {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 512, 'use_linear': False}, {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 7, 'max_res': 2048, 'use_linear': False} ], proposal_initial_sampler='piecewise', interlevel_loss_mult=1.0, distortion_loss_mult=0.002, orientation_loss_mult=0.0001, pred_normal_loss_mult=0.001, use_proposal_weight_anneal=True, use_appearance_embedding=True, use_average_appearance_embedding=True, proposal_weights_anneal_slope=10.0, proposal_weights_anneal_max_num_iters=5000, use_single_jitter=True, predict_normals=False, disable_scene_contraction=False, use_gradient_scaling=False, implementation='tcnn', appearance_embed_dim=32, average_init_density=0.01, camera_optimizer=CameraOptimizerConfig( _target=, mode='SO3xR3', trans_l2_penalty=0.01, rot_l2_penalty=0.001, optimizer=None, scheduler=None ) ) ), optimizers={ 'proposal_networks': { 'optimizer': RAdamOptimizerConfig( _target=, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': None }, 'fields': { 'optimizer': RAdamOptimizerConfig( _target=, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=50000, ramp='cosine' ) }, 'camera_opt': { 'optimizer': AdamOptimizerConfig( _target=, lr=0.001, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=5000, ramp='cosine' ) } }, vis='wandb', data=WindowsPath('D:/nerf/v-tree'), prompt=None, relative_model_dir=WindowsPath('nerfstudio_models'), load_scheduler=True, steps_per_save=2000, steps_per_eval_batch=500, steps_per_eval_image=500, steps_per_eval_all_images=25000, max_num_iterations=50000, mixed_precision=True, use_grad_scaler=False, save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None, load_checkpoint=None, log_gradients=False, gradient_accumulation_steps={} ) ─────────────────────────────────────── [23:25:49] Saving config to: D:\nerf\v-tree\v-tree\nerfacto\2024-02-28_232548\config.yml experiment_config.py:136 Saving checkpoints to: D:\nerf\v-tree\v-tree\nerfacto\2024-02-28_232548\nerfstudio_models trainer.py:136 Auto image downscale factor of 1 nerfstudio_dataparser.py:484 Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:47 Traceback (most recent call last): File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\Vindia\miniconda3\envs\nerfstudio\Scripts\ns-train.exe\__main__.py", line 7, in File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 262, in entrypoint main( File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 247, in main launch( File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 189, in launch main_func(local_rank=0, world_size=world_size, config=config) File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 99, in train_loop trainer.setup() File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\engine\trainer.py", line 149, in setup self.pipeline = self.config.pipeline.setup( File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\configs\base_config.py", line 54, in setup return self._target(self, **kwargs) File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\pipelines\base_pipeline.py", line 254, in __init__ self.datamanager: DataManager = config.datamanager.setup( File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\configs\base_config.py", line 54, in setup return self._target(self, **kwargs) File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\parallel_datamanager.py", line 178, in __init__ super().__init__() File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\base_datamanager.py", line 181, in __init__ self.setup_train() File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\parallel_datamanager.py", line 244, in setup_train self.data_procs = [ File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\parallel_datamanager.py", line 245, in DataProcessor( File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\parallel_datamanager.py", line 96, in __init__ self.cache_images() File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\datamanagers\parallel_datamanager.py", line 128, in cache_images self.img_data = self.config.collate_fn(batch_list) File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\utils\nerfstudio_collate.py", line 122, in nerfstudio_collate {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem} File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\utils\nerfstudio_collate.py", line 122, in {key: nerfstudio_collate([d[key] for d in batch], extra_mappings=extra_mappings) for key in elem} File "C:\Users\Vindia\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\data\utils\nerfstudio_collate.py", line 103, in nerfstudio_collate return torch.stack(batch, 0, out=out) RuntimeError: [enforce fail at alloc_cpu.cpp:80] data. DefaultCPUAllocator: not enough memory: you tried to allocate 93361766400 bytes. ``` <\details>
vindia9 commented 6 months ago

It's cool when I upgraded the memory to 384GB, all the issues were resolved. Finally memory usage stabilized at 281GB.