jkulhanek / tetra-nerf

Official implementation for Tetra-NeRF paper - NeRF represented as triangulation of input point cloud.
https://jkulhanek.com/tetra-nerf
MIT License
266 stars 14 forks source link

RuntimeError: CUDA call (cudaMalloc ...... failed with error: 'out of memory #12

Closed conby closed 1 year ago

conby commented 1 year ago

Hello, we encountered this OOM situation with a NVIDIA 6G GPU, is there any solution to run with low CUDA memory? PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 seems it didn't work here

Any comments will be appreciated.

-------------------------------------log---------------------------------------------- PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 ns-train tetra-nerf-original --pipeline.model.tetrahedra-path data/blender/chair/pointnerf-0.5.th blender-data --data data/blender/chair JAX not installed, skipping Mip-NeRF SSIM ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── TrainerConfig( _target=<class 'nerfstudio.engine.trainer.Trainer'>, output_dir=PosixPath('outputs'), method_name='tetra-nerf-original', experiment_name=None, project_name='nerfstudio-project', timestamp='2023-06-03_022954', machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'), logging=LoggingConfig( relative_log_dir=PosixPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>, <EventName.ETA: 'ETA (time)'> ), max_log_size=10 ), profiler='basic' ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', websocket_port=None, websocket_port_default=7007, websocket_host='0.0.0.0', num_rays_per_chunk=32768, max_num_display_images=512, quit_on_train_completion=False, image_format='jpeg', jpeg_quality=90 ), pipeline=VanillaPipelineConfig( _target=<class 'tetranerf.nerfstudio.pipeline.TetrahedraNerfPipeline'>, datamanager=VanillaDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>, data=None, camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='off', position_noise_std=0.0, orientation_noise_std=0.0, optimizer=AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.0006, eps=1e-15, max_norm=None, weight_decay=0 ), scheduler=ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=None, warmup_steps=0, max_steps=10000, ramp='cosine' ), param_group='camera_opt' ), dataparser=BlenderDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.blender_dataparser.Blender'>, data=PosixPath('data/blender/chair'), scale_factor=1.0, alpha_color='white' ), train_num_rays_per_batch=4096, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=4096, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), collate_fn=<function nerfstudio_collate at 0x7f3fe99b27a0>, camera_res_scale_factor=1.0, patch_size=1 ), model=TetrahedraNerfConfig( _target=<class 'tetranerf.nerfstudio.model.TetrahedraNerf'>, enable_collider=True, collider_params={'near_plane': 2.0, 'far_plane': 6.0}, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=4096, tetrahedra_path=PosixPath('data/blender/chair/pointnerf-0.5.th'), num_tetrahedra_vertices=174525, num_tetrahedra_cells=1087011, max_intersected_triangles=512, num_samples=256, num_fine_samples=256, use_biased_sampler=False, field_dim=64, num_color_layers=1, num_density_layers=3, hidden_size=128, input_fourier_frequencies=0, initialize_colors=True, use_gradient_scaling=False ) ), optimizers={ 'fields': { 'optimizer': RAdamOptimizerConfig( _target=<class 'torch.optim.radam.RAdam'>, lr=0.001, eps=1e-08, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=300000, ramp='cosine' ) } }, vis='wandb', data=None, relative_model_dir=PosixPath('nerfstudio_models'), steps_per_save=25000, steps_per_eval_batch=1000, steps_per_eval_image=2000, steps_per_eval_all_images=50000, max_num_iterations=300000, mixed_precision=False, use_grad_scaler=False, save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None, load_checkpoint=None, log_gradients=False ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [02:29:54] Saving config to: outputs/unnamed/tetra-nerf-original/2023-06-03_022954/config.yml experiment_config.py:128 Saving checkpoints to: outputs/unnamed/tetra-nerf-original/2023-06-03_022954/nerfstudio_models trainer.py:136 Setting up training dataset... Caching all 100 images. Setting up evaluation dataset... Caching all 100 images. No Nerfstudio checkpoint to load, so training from scratch. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.15.3 wandb: W&B syncing is set to offline in this directory. wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. logging events to: outputs/unnamed/tetra-nerf-original/2023-06-03_022954 Tetrahedra initialized from file data/blender/chair/pointnerf-0.5.th: Num points: 174525 Num tetrahedra: 1087011 [ 4][ KNOBS]: All knobs on default.

[ 4][ DISK CACHE]: Opened database: "/var/tmp/OptixCache_ubuntu/optix7cache.db" [ 4][ DISK CACHE]: Cache data size: "30.2 KiB" [ 4][ DISKCACHE]: Cache hit for key: ptx-14549-keyefbf26c79f6345943421c125989da67a-sm_75-rtc0-drv525.105.17 [ 4][COMPILE FEEDBACK]: [ 4][COMPILE FEEDBACK]: Info: Pipeline has 1 module(s), 4 entry function(s), 1 trace call(s), 0 continuation callable call(s), 0 direct callable call(s), 59 basic block(s) in entry functions, 543 instruction(s) in entry functions, 8 non-entry function(s), 63 basic block(s) in non-entry functions, 811 instruction(s) in non-entry functions, no debug information

[02:30:03] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:408 wrapping. Step (% Done) Train Iter (time) ETA (time)

0 (0.00%) 1 s, 217.954 ms 4 d, 5 h, 29 m, 46 s Printing profiling stats, from longest to shortest duration in seconds Trainer.train_iteration: 0.4076 VanillaPipeline.get_train_loss_dict: 0.2806 Trainer.eval_iteration: 0.0000 Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in sys.exit(entrypoint()) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 260, in entrypoint main( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 246, in main launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 185, in launch main_func(local_rank=0, world_size=world_size, config=config) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 100, in train_loop trainer.train() File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 240, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(*args, kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 446, in trainiteration , loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(*args, *kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/pipelines/base_pipeline.py", line 276, in get_train_loss_dict model_outputs = self._model(ray_bundle) # train distributed data parallel model if world_size > 1 File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/models/base_model.py", line 140, in forward return self.get_outputs(ray_bundle) File "/home/ubuntu/tetra-nerf/tetranerf/nerfstudio/model.py", line 440, in get_outputs tracer_output = tracer.trace_rays( RuntimeError: CUDA call (cudaMalloc( reinterpret_cast<void *>(&triangle_hit_distances), sizeof(float) max_ray_triangles * num_rays ) ) failed with error: 'out of memory' (/home/ubuntu/tetra-nerf/src/tetrahedra_tracer.cpp:404)

Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in sys.exit(entrypoint()) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 260, in entrypoint main( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 246, in main launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 185, in launch main_func(local_rank=0, world_size=world_size, config=config) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 100, in train_loop trainer.train() File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 240, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(*args, kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 446, in trainiteration , loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(*args, *kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/pipelines/base_pipeline.py", line 276, in get_train_loss_dict model_outputs = self._model(ray_bundle) # train distributed data parallel model if world_size > 1 File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/models/base_model.py", line 140, in forward return self.get_outputs(ray_bundle) File "/home/ubuntu/tetra-nerf/tetranerf/nerfstudio/model.py", line 440, in get_outputs tracer_output = tracer.trace_rays( RuntimeError: CUDA call (cudaMalloc( reinterpret_cast<void *>(&triangle_hit_distances), sizeof(float) max_ray_triangles * num_rays ) ) failed with error: 'out of memory' (/home/ubuntu/tetra-nerf/src/tetrahedra_tracer.cpp:404)

wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: ETA (time) █▄▃▂▁▁ wandb: GPU Memory (MB) ▁ wandb: Train Iter (time) █▄▃▂▁▁ wandb: Train Loss ▁ wandb: Train Loss Dict/rgb_loss ▁ wandb: Train Rays / Sec ▁██▆ wandb: learning_rate/fields █▇▅▄▂▁ wandb: wandb: Run summary: wandb: ETA (time) 142501.25605 wandb: GPU Memory (MB) 2860.13672 wandb: Train Iter (time) 0.47501 wandb: Train Loss 0.01844 wandb: Train Loss Dict/rgb_loss 0.01844 wandb: Train Rays / Sec 13076.66933 wandb: learning_rate/fields 0.001 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync outputs/unnamed/tetra-nerf-original/2023-06-03_022954/wandb/offline-run-20230603_023001-79ywy4ew wandb: Find logs at: outputs/unnamed/tetra-nerf-original/2023-06-03_022954/wandb/offline-run-20230603_023001-79ywy4ew/logs

jkulhanek commented 1 year ago

I am sorry, but 6GB is not enough to run the original configuration. Please try ‘ns-train tetra-nerf’ instead and reduce number of rays per batch.

conby commented 1 year ago

Thanks for your response, with tetra-nerf and reduce number of rays per batch to 32, ns-train still OOM (always on 0.66% of train progress)

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 ns-train tetra-nerf --pipeline.datamanager.train-num-rays-per-batch 32 --pipeline.datamanager.eval-num-rays-per-batch 32 --pipeline.model.tetrahedra-path data/blender/chair/pointnerf-0.5.th blender-data --data data/blender/chair

RuntimeError: CUDA call (cudaMalloc( reinterpret_cast<void *>(&triangle_hit_distances), sizeof(float) max_ray_triangles * num_rays ) ) failed with error: 'out of memory' (/home/ubuntu/tetra-nerf/src/tetrahedra_tracer.cpp:404)

wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: ETA (time) █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: Eval Loss ▁ wandb: Eval Loss Dict/rgb_loss ▁ wandb: GPU Memory (MB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: Train Iter (time) █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: Train Loss █▃▂▂▅▂▇▂▃▂▃▃▁▂▁▂▃▂▄▃▄▂▁▂▁▃▃▂▁▂▂▂▂▁▁▂▂▂▂▁ wandb: Train Loss Dict/rgb_loss █▃▂▂▅▂▇▂▃▂▃▃▁▂▁▂▃▂▄▃▄▂▁▂▁▃▃▂▁▂▂▂▂▁▁▂▂▂▂▁ wandb: Train Rays / Sec ▅▃▅▄▃▃▄▃▄█▅▇▅▃▆▃▁▁▃▆▅▃▅▄▅▃▆▃▂▃▄▃▄▃▅▄▆█▄▃ wandb: learning_rate/fields ███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁ wandb: wandb: Run summary: wandb: ETA (time) 9118.6457 wandb: Eval Loss 0.00899 wandb: Eval Loss Dict/rgb_loss 0.00899 wandb: GPU Memory (MB) 408.64062 wandb: Train Iter (time) 0.0306 wandb: Train Loss 0.00505 wandb: Train Loss Dict/rgb_loss 0.00505 wandb: Train Rays / Sec 1048.40654 wandb: learning_rate/fields 0.00098

jkulhanek commented 1 year ago

I recommend limiting max visited triangles

jkulhanek commented 1 year ago

Does nerfacto train on your data without problems?

conby commented 1 year ago

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 ns-train tetra-nerf --pipeline.datamanager.train-num-rays-per-batch 32 --pipeline.datamanager.eval-num-rays-per-batch 32 --pipeline.model.max-intersected-triangles 256 --pipeline.model.eval-num-rays-per-chunk 32 --pipeline.model.num-samples 32 --pipeline.model.num-fine-samples 32 --pipeline.model.tetrahedra-path data/blender/chair/pointnerf-0.5.th blender-data --data data/blender/chair JAX not installed, skipping Mip-NeRF SSIM ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── TrainerConfig( _target=<class 'nerfstudio.engine.trainer.Trainer'>, output_dir=PosixPath('outputs'), method_name='tetra-nerf', experiment_name=None, project_name='nerfstudio-project', timestamp='2023-06-03_073127', machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'), logging=LoggingConfig( relative_log_dir=PosixPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>, <EventName.ETA: 'ETA (time)'> ), max_log_size=10 ), profiler='basic' ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', websocket_port=None, websocket_port_default=7007, websocket_host='0.0.0.0', num_rays_per_chunk=32768, max_num_display_images=512, quit_on_train_completion=False, image_format='jpeg', jpeg_quality=90 ), pipeline=VanillaPipelineConfig( _target=<class 'tetranerf.nerfstudio.pipeline.TetrahedraNerfPipeline'>, datamanager=VanillaDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>, data=None, camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='off', position_noise_std=0.0, orientation_noise_std=0.0, optimizer=AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.0006, eps=1e-15, max_norm=None, weight_decay=0 ), scheduler=ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=None, warmup_steps=0, max_steps=10000, ramp='cosine' ), param_group='camera_opt' ), dataparser=BlenderDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.blender_dataparser.Blender'>, data=PosixPath('data/blender/chair'), scale_factor=1.0, alpha_color='white' ), train_num_rays_per_batch=32, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=32, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), collate_fn=<function nerfstudio_collate at 0x7f577c54a7a0>, camera_res_scale_factor=1.0, patch_size=1 ), model=TetrahedraNerfConfig( _target=<class 'tetranerf.nerfstudio.model.TetrahedraNerf'>, enable_collider=True, collider_params={'near_plane': 2.0, 'far_plane': 6.0}, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=32, tetrahedra_path=PosixPath('data/blender/chair/pointnerf-0.5.th'), num_tetrahedra_vertices=174525, num_tetrahedra_cells=1087011, max_intersected_triangles=256, num_samples=32, num_fine_samples=32, use_biased_sampler=True, field_dim=64, num_color_layers=1, num_density_layers=3, hidden_size=128, input_fourier_frequencies=0, initialize_colors=True, use_gradient_scaling=True ) ), optimizers={ 'fields': { 'optimizer': RAdamOptimizerConfig( _target=<class 'torch.optim.radam.RAdam'>, lr=0.001, eps=1e-08, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=300000, ramp='cosine' ) } }, vis='wandb', data=None, relative_model_dir=PosixPath('nerfstudio_models'), steps_per_save=25000, steps_per_eval_batch=1000, steps_per_eval_image=2000, steps_per_eval_all_images=50000, max_num_iterations=300000, mixed_precision=False, use_grad_scaler=False, save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None, load_checkpoint=None, log_gradients=False ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [07:31:27] Saving config to: outputs/unnamed/tetra-nerf/2023-06-03_073127/config.yml experiment_config.py:128 Saving checkpoints to: outputs/unnamed/tetra-nerf/2023-06-03_073127/nerfstudio_models trainer.py:136 Setting up training dataset... Caching all 100 images. Setting up evaluation dataset... Caching all 100 images. No Nerfstudio checkpoint to load, so training from scratch. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.15.3 wandb: W&B syncing is set to offline in this directory. wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. logging events to: outputs/unnamed/tetra-nerf/2023-06-03_073127 Tetrahedra initialized from file data/blender/chair/pointnerf-0.5.th: Num points: 174525 Num tetrahedra: 1087011 [ 4][ KNOBS]: All knobs on default.

[ 4][ DISK CACHE]: Opened database: "/var/tmp/OptixCache_ubuntu/optix7cache.db" [ 4][ DISK CACHE]: Cache data size: "30.2 KiB" [ 4][ DISKCACHE]: Cache hit for key: ptx-14549-keyefbf26c79f6345943421c125989da67a-sm_75-rtc0-drv525.105.17 [ 4][COMPILE FEEDBACK]: [ 4][COMPILE FEEDBACK]: Info: Pipeline has 1 module(s), 4 entry function(s), 1 trace call(s), 0 continuation callable call(s), 0 direct callable call(s), 59 basic block(s) in entry functions, 543 instruction(s) in entry functions, 8 non-entry function(s), 63 basic block(s) in non-entry functions, 811 instruction(s) in non-entry functions, no debug information

[07:31:34] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:408 wrapping. Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Test Rays / Sec

3710 (1.24%) 30.652 ms 2 h, 31 m, 21 s 1.05 K 3720 (1.24%) 31.427 ms 2 h, 35 m, 11 s 1.02 K 3730 (1.24%) 31.215 ms 2 h, 34 m, 7 s 1.03 K 3740 (1.25%) 31.523 ms 2 h, 35 m, 39 s 1.02 K 3750 (1.25%) 31.349 ms 2 h, 34 m, 47 s 1.02 K 3760 (1.25%) 30.495 ms 2 h, 30 m, 33 s 1.05 K 3770 (1.26%) 30.634 ms 2 h, 31 m, 14 s 1.05 K 3780 (1.26%) 31.124 ms 2 h, 33 m, 39 s 1.03 K 3790 (1.26%) 31.544 ms 2 h, 35 m, 43 s 1.02 K 3800 (1.27%) 31.634 ms 2 h, 36 m, 9 s 1.01 K Printing profiling stats, from longest to shortest duration in seconds VanillaPipeline.get_eval_image_metrics_and_images: 117.4870 Trainer.eval_iteration: 0.0309 Trainer.train_iteration: 0.0300 VanillaPipeline.get_eval_loss_dict: 0.0243 VanillaPipeline.get_train_loss_dict: 0.0231 Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in sys.exit(entrypoint()) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 260, in entrypoint main( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 246, in main launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 185, in launch main_func(local_rank=0, world_size=world_size, config=config) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 100, in train_loop trainer.train() File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 240, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(*args, *kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 448, in train_iteration self.grad_scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/ubuntu/.local/bin/ns-train", line 8, in sys.exit(entrypoint()) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 260, in entrypoint main( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 246, in main launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 185, in launch main_func(local_rank=0, world_size=world_size, config=config) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/scripts/train.py", line 100, in train_loop trainer.train() File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 240, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/utils/profiler.py", line 127, in inner out = func(args, **kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/nerfstudio/engine/trainer.py", line 448, in train_iteration self.grad_scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn wandb: Waiting for W&B process to finish... (failed 1). wandb: wandb: Run history: wandb: ETA (time) ▁▇▅▅▅▄▅▇▄▇▇▆▆▇▆▇▅▇▄█▄▄▅▄▅▅▆▄▄▅▂▆▅▄▄▆▇▇▅▆ wandb: Eval Images Metrics/image_idx ▁ wandb: Eval Images Metrics/lpips ▁ wandb: Eval Images Metrics/nerfstudio_ssim ▁ wandb: Eval Images Metrics/num_rays ▁ wandb: Eval Images Metrics/psnr ▁ wandb: Eval Images Metrics/skimage_ssim ▁ wandb: Eval Loss ▂▁█ wandb: Eval Loss Dict/rgb_loss ▂▁█ wandb: GPU Memory (MB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███████████████████ wandb: Test Rays / Sec ▁ wandb: Train Iter (time) ▁▆▅▅▅▄▅▇▄▇▇▆▆▇▆▇▅▇▄█▄▄▅▄▅▅▆▄▄▅▂▆▅▅▅▇▇▇▅▆ wandb: Train Loss ▃▂▃█▃▃▂▂▂▂▃▂▂▄▁▂▂▂▁▂▂▃▂▂▃▄▂▁▂▁▂▂▂▂▂▄▂▂▂▂ wandb: Train Loss Dict/rgb_loss ▃▂▃█▃▃▂▂▂▂▃▂▂▄▁▂▂▂▁▂▂▃▂▂▃▄▂▁▂▁▂▂▂▂▂▄▂▂▂▂ wandb: Train Rays / Sec █▂▃▄▃▅▄▂▅▂▂▃▃▂▃▂▃▂▅▁▄▄▃▄▃▄▂▄▄▃▆▂▄▄▄▂▁▁▄▂ wandb: learning_rate/fields ███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁ wandb: wandb: Run summary: wandb: ETA (time) 9428.68531 wandb: Eval Images Metrics/image_idx 2.0 wandb: Eval Images Metrics/lpips 0.2204 wandb: Eval Images Metrics/nerfstudio_ssim 0.80186 wandb: Eval Images Metrics/num_rays 640000.0 wandb: Eval Images Metrics/psnr 19.81445 wandb: Eval Images Metrics/skimage_ssim 0.78473 wandb: Eval Loss 0.01148 wandb: Eval Loss Dict/rgb_loss 0.01148 wandb: GPU Memory (MB) 3626.26172 wandb: Test Rays / Sec 5447.41097 wandb: Train Iter (time) 0.03183 wandb: Train Loss 0.00546 wandb: Train Loss Dict/rgb_loss 0.00546 wandb: Train Rays / Sec 1007.51773 wandb: learning_rate/fields 0.00097 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync outputs/unnamed/tetra-nerf/2023-06-03_073127/wandb/offline-run-20230603_073133-p3wz1c75 wandb: Find logs at: outputs/unnamed/tetra-nerf/2023-06-03_073127/wandb/offline-run-20230603_073133-p3wz1c75/logs

jkulhanek commented 1 year ago

Seems like the num rays per batch was too small and no ray intersected the tetrahedra-> no gradient was computed.

conby commented 1 year ago

set nums_rays_per_batch to 128, still trainning...

Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Test Rays / Sec

15900 (5.30%) 31.378 ms 2 h, 28 m, 34 s 4.09 K 15910 (5.30%) 29.986 ms 2 h, 21 m, 58 s 4.28 K 15920 (5.31%) 30.333 ms 2 h, 23 m, 37 s 4.24 K 15930 (5.31%) 31.982 ms 2 h, 31 m, 25 s 4.01 K 15940 (5.31%) 32.258 ms 2 h, 32 m, 43 s 3.97 K 15950 (5.32%) 32.041 ms 2 h, 31 m, 41 s 4.01 K 15960 (5.32%) 31.386 ms 2 h, 28 m, 34 s 4.09 K 15970 (5.32%) 31.485 ms 2 h, 29 m, 2 s 4.08 K 15980 (5.33%) 32.068 ms 2 h, 31 m, 47 s 4.00 K 15990 (5.33%) 32.508 ms 2 h, 33 m, 52 s 3.95 K

jkulhanek commented 1 year ago

128 seems quite low. Would you be able to use at least 1024?

conby commented 1 year ago

128 seems quite low. Would you be able to use at least 1024?

Yes, 1024 is still going well

jkulhanek commented 1 year ago

You can try increasing until you hit oom.