kwea123 / nerf_pl

NeRF (Neural Radiance Fields) and NeRF in the Wild using pytorch-lightning
https://www.youtube.com/playlist?list=PLDV2CyUo4q-K02pNEyDr7DYpTQuka3mbV
MIT License
2.74k stars 483 forks source link

RuntimeError: CUDA error: device-side assert triggered #173

Closed LongruiDong closed 2 years ago

LongruiDong commented 2 years ago

Hi~ I am trying to train nerf-w on Replica (raw imgsize 1200*680, use 400 frames for train). But it failied to launch training with following error: my GPU: Titan rtx 24GB , I believe it`s not the memory issue? Because I have tried decrease batch_size but the error still arise

$ python train.py --root_dir /media/dlr/nd/Replica/office0/ --dataset_name replica --img_downscale 8 --use_cache --N_importance 64 --N_samples 64 --encode_a --encode_t --beta_min 0.03 --N_vocab 1500 --num_epochs 20 --batch_size 1024 --optimizer adam --lr 5e-4 --lr_scheduler cosine --exp_name replicaoffice0_scale8_nerfw
Namespace(N_a=48, N_emb_dir=4, N_emb_xyz=10, N_importance=64, N_samples=64, N_tau=16, N_vocab=1500, batch_size=1024, beta_min=0.03, chunk=32768, ckpt_path=None, data_perturb=[], dataset_name='replica', decay_gamma=0.1, decay_step=[20], encode_a=True, encode_t=True, exp_name='replicaoffice0_scale8_nerfw', img_downscale=8, img_wh=[800, 800], lr=0.0005, lr_scheduler='cosine', momentum=0.9, noise_std=1.0, num_epochs=20, num_gpus=1, optimizer='adam', perturb=1.0, poly_exp=0.9, prefixes_to_ignore=['loss'], refresh_every=1, root_dir='/media/dlr/nd/Replica/office0/', use_cache=True, use_disp=False, warmup_epochs=0, warmup_multiplier=1.0, weight_decay=0)
GPU available: True, used: True
INFO - 2022-08-22 11:32:59,622 - distributed - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - 2022-08-22 11:32:59,622 - distributed - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO - 2022-08-22 11:32:59,623 - accelerator_connector - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Epoch 0:   0%|                                                                                                                             | 0/4982 [00:00<?, ?it/s]

Profiler Report

Action                          |  Mean duration (s)    |Num calls              |  Total time (s)       |  Percentage %         |
-----------------------------------------------------------------------------------------------------------------------------
Total                           |  -                    |_                      |  5.0118               |  100 %                |
-----------------------------------------------------------------------------------------------------------------------------
run_training_epoch              |  0.92119              |1                      |  0.92119              |  18.38                |
get_train_batch                 |  0.81981              |1                      |  0.81981              |  16.357               |
evaluation_step_and_end         |  0.58803              |1                      |  0.58803              |  11.733               |
run_training_batch              |  0.10115              |1                      |  0.10115              |  2.0182               |
optimizer_step_and_closure_0    |  0.099758             |1                      |  0.099758             |  1.9905               |
training_step_and_backward      |  0.099435             |1                      |  0.099435             |  1.984                |
model_forward                   |  0.099398             |1                      |  0.099398             |  1.9833               |
on_validation_batch_end         |  0.00109              |1                      |  0.00109              |  0.021749             |
on_train_end                    |  0.0010118            |1                      |  0.0010118            |  0.020188             |
cache_result                    |  7.9223e-05           |12                     |  0.00095068           |  0.018969             |
on_validation_end               |  0.00050522           |1                      |  0.00050522           |  0.010081             |
on_epoch_start                  |  0.0004731            |1                      |  0.0004731            |  0.0094396            |
on_train_start                  |  0.00035004           |1                      |  0.00035004           |  0.0069842            |
on_batch_start                  |  0.00028297           |1                      |  0.00028297           |  0.0056461            |
on_validation_batch_start       |  0.00015701           |1                      |  0.00015701           |  0.0031328            |
validation_step_end             |  6.5605e-05           |1                      |  6.5605e-05           |  0.001309             |
on_train_batch_start            |  4.6355e-05           |1                      |  4.6355e-05           |  0.00092491           |
on_validation_epoch_end         |  4.3134e-05           |1                      |  4.3134e-05           |  0.00086064           |
on_fit_start                    |  3.5627e-05           |1                      |  3.5627e-05           |  0.00071085           |
on_train_epoch_start            |  2.2042e-05           |1                      |  2.2042e-05           |  0.0004398            |
on_validation_start             |  2.1432e-05           |1                      |  2.1432e-05           |  0.00042763           |
on_validation_epoch_start       |  1.3103e-05           |1                      |  1.3103e-05           |  0.00026145           |

INFO - 2022-08-22 11:33:04,636 - profilers - 

Profiler Report

Action                          |  Mean duration (s)    |Num calls              |  Total time (s)       |  Percentage %         |
-----------------------------------------------------------------------------------------------------------------------------
Total                           |  -                    |_                      |  5.0118               |  100 %                |
-----------------------------------------------------------------------------------------------------------------------------
run_training_epoch              |  0.92119              |1                      |  0.92119              |  18.38                |
get_train_batch                 |  0.81981              |1                      |  0.81981              |  16.357               |
evaluation_step_and_end         |  0.58803              |1                      |  0.58803              |  11.733               |
run_training_batch              |  0.10115              |1                      |  0.10115              |  2.0182               |
optimizer_step_and_closure_0    |  0.099758             |1                      |  0.099758             |  1.9905               |
training_step_and_backward      |  0.099435             |1                      |  0.099435             |  1.984                |
model_forward                   |  0.099398             |1                      |  0.099398             |  1.9833               |
on_validation_batch_end         |  0.00109              |1                      |  0.00109              |  0.021749             |
on_train_end                    |  0.0010118            |1                      |  0.0010118            |  0.020188             |
cache_result                    |  7.9223e-05           |12                     |  0.00095068           |  0.018969             |
on_validation_end               |  0.00050522           |1                      |  0.00050522           |  0.010081             |
on_epoch_start                  |  0.0004731            |1                      |  0.0004731            |  0.0094396            |
on_train_start                  |  0.00035004           |1                      |  0.00035004           |  0.0069842            |
on_batch_start                  |  0.00028297           |1                      |  0.00028297           |  0.0056461            |
on_validation_batch_start       |  0.00015701           |1                      |  0.00015701           |  0.0031328            |
validation_step_end             |  6.5605e-05           |1                      |  6.5605e-05           |  0.001309             |
on_train_batch_start            |  4.6355e-05           |1                      |  4.6355e-05           |  0.00092491           |
on_validation_epoch_end         |  4.3134e-05           |1                      |  4.3134e-05           |  0.00086064           |
on_fit_start                    |  3.5627e-05           |1                      |  3.5627e-05           |  0.00071085           |
on_train_epoch_start            |  2.2042e-05           |1                      |  2.2042e-05           |  0.0004398            |
on_validation_start             |  2.1432e-05           |1                      |  2.1432e-05           |  0.00042763           |
on_validation_epoch_start       |  1.3103e-05           |1                      |  1.3103e-05           |  0.00026145           |

Traceback (most recent call last):
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in train
    self.train_loop.run_training_epoch()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 534, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 692, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 475, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1264, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
    optimizer.step(closure=closure, *args, **kwargs)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 687, in train_step_and_backward_closure
    self.trainer.hiddens
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 780, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 71, in training_step
    return self._step(self.trainer.model.training_step, args)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 66, in _step
    output = model_step(*args)
  File "train.py", line 131, in training_step
    results = self(rays, ts)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "train.py", line 87, in forward
    self.train_dataset.white_back)
  File "/home/dlr/Project/nerf_pl/models/rendering.py", line 278, in render_rays
    inference(results, model, xyz_fine, z_vals, test_time, **kwargs)
  File "/home/dlr/Project/nerf_pl/models/rendering.py", line 118, in inference
    inputs = [embedding_xyz(xyz_[i:i+chunk]), dir_embedded_[i:i+chunk]]
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dlr/Project/nerf_pl/models/nerf.py", line 28, in forward
    out += [func(freq*x)]
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 218, in <module>
    main(hparams)
  File "train.py", line 212, in main
    trainer.fit(system)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 509, in fit
    results = self.accelerator_backend.train()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
    return self.train_or_test()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 591, in train
    self.train_loop.on_train_end()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 182, in on_train_end
    model.cpu()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/pytorch_lightning/utilities/device_dtype_mixin.py", line 138, in cpu
    return super().cpu()
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 471, in cpu
    return self._apply(lambda t: t.cpu())
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/home/dlr/anaconda3/envs/nerfw_pl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 471, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: device-side assert triggered

Do you know why? Thanks in advance!

LongruiDong commented 2 years ago

ooops, I missed your notes about N_vocab 🐶. The id in my data is 2000...so just increase N_vocab