Hi~
I am trying to train nerf-w on Replica (raw imgsize 1200*680, use 400 frames for train).
But it failied to launch training with following error:
my GPU: Titan rtx 24GB , I believe it`s not the memory issue?
Because I have tried decrease batch_size but the error still arise
$ python train.py --root_dir /media/dlr/nd/Replica/office0/ --dataset_name replica --img_downscale 8 --use_cache --N_importance 64 --N_samples 64 --encode_a --encode_t --beta_min 0.03 --N_vocab 1500 --num_epochs 20 --batch_size 1024 --optimizer adam --lr 5e-4 --lr_scheduler cosine --exp_name replicaoffice0_scale8_nerfw
Namespace(N_a=48, N_emb_dir=4, N_emb_xyz=10, N_importance=64, N_samples=64, N_tau=16, N_vocab=1500, batch_size=1024, beta_min=0.03, chunk=32768, ckpt_path=None, data_perturb=[], dataset_name='replica', decay_gamma=0.1, decay_step=[20], encode_a=True, encode_t=True, exp_name='replicaoffice0_scale8_nerfw', img_downscale=8, img_wh=[800, 800], lr=0.0005, lr_scheduler='cosine', momentum=0.9, noise_std=1.0, num_epochs=20, num_gpus=1, optimizer='adam', perturb=1.0, poly_exp=0.9, prefixes_to_ignore=['loss'], refresh_every=1, root_dir='/media/dlr/nd/Replica/office0/', use_cache=True, use_disp=False, warmup_epochs=0, warmup_multiplier=1.0, weight_decay=0)
GPU available: True, used: True
INFO - 2022-08-22 11:32:59,622 - distributed - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - 2022-08-22 11:32:59,622 - distributed - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO - 2022-08-22 11:32:59,623 - accelerator_connector - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Epoch 0: 0%| | 0/4982 [00:00<?, ?it/s]
Profiler Report
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
-----------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 5.0118 | 100 % |
-----------------------------------------------------------------------------------------------------------------------------
run_training_epoch | 0.92119 |1 | 0.92119 | 18.38 |
get_train_batch | 0.81981 |1 | 0.81981 | 16.357 |
evaluation_step_and_end | 0.58803 |1 | 0.58803 | 11.733 |
run_training_batch | 0.10115 |1 | 0.10115 | 2.0182 |
optimizer_step_and_closure_0 | 0.099758 |1 | 0.099758 | 1.9905 |
training_step_and_backward | 0.099435 |1 | 0.099435 | 1.984 |
model_forward | 0.099398 |1 | 0.099398 | 1.9833 |
on_validation_batch_end | 0.00109 |1 | 0.00109 | 0.021749 |
on_train_end | 0.0010118 |1 | 0.0010118 | 0.020188 |
cache_result | 7.9223e-05 |12 | 0.00095068 | 0.018969 |
on_validation_end | 0.00050522 |1 | 0.00050522 | 0.010081 |
on_epoch_start | 0.0004731 |1 | 0.0004731 | 0.0094396 |
on_train_start | 0.00035004 |1 | 0.00035004 | 0.0069842 |
on_batch_start | 0.00028297 |1 | 0.00028297 | 0.0056461 |
on_validation_batch_start | 0.00015701 |1 | 0.00015701 | 0.0031328 |
validation_step_end | 6.5605e-05 |1 | 6.5605e-05 | 0.001309 |
on_train_batch_start | 4.6355e-05 |1 | 4.6355e-05 | 0.00092491 |
on_validation_epoch_end | 4.3134e-05 |1 | 4.3134e-05 | 0.00086064 |
on_fit_start | 3.5627e-05 |1 | 3.5627e-05 | 0.00071085 |
on_train_epoch_start | 2.2042e-05 |1 | 2.2042e-05 | 0.0004398 |
on_validation_start | 2.1432e-05 |1 | 2.1432e-05 | 0.00042763 |
on_validation_epoch_start | 1.3103e-05 |1 | 1.3103e-05 | 0.00026145 |
INFO - 2022-08-22 11:33:04,636 - profilers -
Profiler Report
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
-----------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 5.0118 | 100 % |
-----------------------------------------------------------------------------------------------------------------------------
run_training_epoch | 0.92119 |1 | 0.92119 | 18.38 |
get_train_batch | 0.81981 |1 | 0.81981 | 16.357 |
evaluation_step_and_end | 0.58803 |1 | 0.58803 | 11.733 |
run_training_batch | 0.10115 |1 | 0.10115 | 2.0182 |
optimizer_step_and_closure_0 | 0.099758 |1 | 0.099758 | 1.9905 |
training_step_and_backward | 0.099435 |1 | 0.099435 | 1.984 |
model_forward | 0.099398 |1 | 0.099398 | 1.9833 |
on_validation_batch_end | 0.00109 |1 | 0.00109 | 0.021749 |
on_train_end | 0.0010118 |1 | 0.0010118 | 0.020188 |
cache_result | 7.9223e-05 |12 | 0.00095068 | 0.018969 |
on_validation_end | 0.00050522 |1 | 0.00050522 | 0.010081 |
on_epoch_start | 0.0004731 |1 | 0.0004731 | 0.0094396 |
on_train_start | 0.00035004 |1 | 0.00035004 | 0.0069842 |
on_batch_start | 0.00028297 |1 | 0.00028297 | 0.0056461 |
on_validation_batch_start | 0.00015701 |1 | 0.00015701 | 0.0031328 |
validation_step_end | 6.5605e-05 |1 | 6.5605e-05 | 0.001309 |
on_train_batch_start | 4.6355e-05 |1 | 4.6355e-05 | 0.00092491 |
on_validation_epoch_end | 4.3134e-05 |1 | 4.3134e-05 | 0.00086064 |
on_fit_start | 3.5627e-05 |1 | 3.5627e-05 | 0.00071085 |
on_train_epoch_start | 2.2042e-05 |1 | 2.2042e-05 | 0.0004398 |
on_validation_start | 2.1432e-05 |1 | 2.1432e-05 | 0.00042763 |
on_validation_epoch_start | 1.3103e-05 |1 | 1.3103e-05 | 0.00026145 |
Traceback (most recent call last):
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in train
self.train_loop.run_training_epoch()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 534, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 692, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 475, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1264, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
optimizer.step(closure=closure, *args, **kwargs)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 687, in train_step_and_backward_closure
self.trainer.hiddens
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 780, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
training_step_output = self.trainer.accelerator_backend.training_step(args)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 71, in training_step
return self._step(self.trainer.model.training_step, args)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 66, in _step
output = model_step(*args)
File "train.py", line 131, in training_step
results = self(rays, ts)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "train.py", line 87, in forward
self.train_dataset.white_back)
File "/home/dlr/Project/nerf_pl/models/rendering.py", line 278, in render_rays
inference(results, model, xyz_fine, z_vals, test_time, **kwargs)
File "/home/dlr/Project/nerf_pl/models/rendering.py", line 118, in inference
inputs = [embedding_xyz(xyz_[i:i+chunk]), dir_embedded_[i:i+chunk]]
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dlr/Project/nerf_pl/models/nerf.py", line 28, in forward
out += [func(freq*x)]
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 218, in <module>
main(hparams)
File "train.py", line 212, in main
trainer.fit(system)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 509, in fit
results = self.accelerator_backend.train()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
return self.train_or_test()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
results = self.trainer.train()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 591, in train
self.train_loop.on_train_end()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 182, in on_train_end
model.cpu()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 138, in cpu
return super().cpu()
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 471, in cpu
return self._apply(lambda t: t.cpu())
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 471, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: device-side assert triggered
Hi~ I am trying to train nerf-w on Replica (raw imgsize 1200*680, use 400 frames for train). But it failied to launch training with following error: my GPU: Titan rtx 24GB , I believe it`s not the memory issue? Because I have tried decrease batch_size but the error still arise
Do you know why? Thanks in advance!