CUDA out of memory - Githubissues

kukumallou commented 8 months ago

First of all, thanks for the contribution. Very nice project. I came across CUDA out of memory when running dense reconstruction (run_neuralangelo-colmap_dense.sh)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 19.51 GiB already allocated; 911.19 MiB free; 20.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to smaller the number of samples per ray from 1024. to 512 and 256 as suggested in the FAQ. But the error message is the same. BTW I have succeeded in running sparse reconstruction script and got correct results. Any idea to fix this problem? Thanks a lot

hugoycj commented 8 months ago

Would you mind testing the latest version, and replace python launch.py --config configs/neuralangelo-colmap_dense-SH.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR with python launch.py --config configs/neuralangelo-colmap_dense.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR in run_neuralangelo-colmap_dense.sh script.

kukumallou commented 8 months ago

I tried the latest version (Tue Nov 2), but the error still exists. The card I got is a 4090 with 24G memory.

hugoycj commented 8 months ago

Sorry to bother you. Would you mind to share in which step the out of memory happens and what's the resolution of your images

kukumallou commented 8 months ago

There're 140 images with resolution 1920x1440. And below is the output log of the script.

---sfm--- Sparse map datasets/cake exist. Aborting ---model_converter--- ---colmap2mvsnet--- Image pair datasets/cake/dense/pair.txt exist. Aborting Number of model parameters: 1162696 load third_party/Vis-MVSNet/pretrained_model/vis/-1 (1, 1, 528, 960): 100%|█████| 140/140 [02:39<00:00, 1.14s/it] ---mvsnet_fusion--- load data: 100%|███| 140/140 [00:01<00:00, 137.01it/s] prob filter: 100%|███ 140/140 [00:00<00:00, 203.46it/s] vis filter and med fusion: 100%|████| 140/140 [00:05<00:00, 27.54it/s] vis filter and ave fusion: 100%|████| 140/140 [00:04<00:00, 31.20it/s] vis filter: 100%|███| 140/140 [00:04<00:00, 30.62it/s] back proj: 100%|████| 140/140 [00:00<00:00, 293.64it/s] Construct combined PCD Estimate normal ---angelo_recon--- Global seed set to 42 Using 16bit native Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params

0 | model | NeuSModel | 28.0 M

28.0 M Trainable params 0 Non-trainable params 28.0 M Total params 55.914 Total estimated model params size (MB) Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.06801176275750971 Traceback (most recent call last): File "launch.py", line 125, in main() File "launch.py", line 114, in main trainer.fit(system, datamodule=dm) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/****/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, args, trainer=self, kwargs) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 194, in advance response = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/**/Dev/instant-angelo/systems/base.py", line 57, in on_train_batch_start update_module_step(self.model, self.current_epoch, self.global_step) File "/home//Dev/instant-angelo/systems/utils.py", line 351, in update_module_step m.update_step(epoch, global_step) File "/home//Dev/instant-angelo/models/neus.py", line 111, in update_step self.occupancy_grid_bg.every_n_step(step=global_step, occ_eval_fn=occ_eval_fn_bg, occ_thre=self.config.get('grid_prune_occ_thre_bg', 0.01)) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 271, in every_n_step self._update( File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 229, in _update occ = occ_eval_fn(x).squeeze(-1) File "/home//Dev/instant-angelo/models/neus.py", line 104, in occ_eval_fnbg density, = self.geometry_bg(x) File "/home/**/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/*/Dev/instant-angelo/models/geometry.py", line 125, in forward out = self.encoding_with_network(points.view(-1, self.n_input_dims)).view(*points.shape[:-1], self.n_output_dims).float() File "/home/***/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home//Dev/instant-angelo/models/network_utils.py", line 193, in forward return self.network(self.encoding(x)) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/****/Dev/instant-angelo/models/network_utils.py", line 76, in forward return self.encoding(x, args) if not self.include_xyz else torch.cat([x self.xyz_scale + self.xyz_offset, self.encoding(x, args)], dim=-1) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 23.64 GiB total capacity; 19.97 GiB already allocated; 970.44 MiB free; 20.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: : 0it [00:07, ?it/s] start time: 2023-11-03 08:46:23 sfm time: 2023-11-03 08:46:23 model_converter finished: 2023-11-03 08:46:24 colmap2mvsnet finished: 2023-11-03 08:46:25 mvsnet_inference finished: 2023-11-03 08:49:06 mvsnet_fusion finished: 2023-11-03 08:49:33 angelo_recon finished: 2023-11-03 08:50:11

lyupei commented 8 months ago

Hi, I have decreased the model.num_samples_per_ray from 1024 to 128, but still encountered vram OOM issues. I'm using a 2070 with 8G vram, can I run this project by adjusting other parameters?

hugoycj / Instant-angelo

CUDA out of memory #18

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params

0 | model | NeuSModel | 28.0 M