rasmushaugaard / surfemb

SurfEmb (CVPR 2022)
https://surfemb.github.io/
MIT License
77 stars 17 forks source link

error when resuming from checkpoint #21

Closed MoritzkoLP closed 1 year ago

MoritzkoLP commented 2 years ago

whenever i try to resume from a previous checkpoint, i get this error: File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/moritz/surfemb/surfemb/surfemb/scripts/train.py", line 123, in <module> main() File "/home/moritz/surfemb/surfemb/surfemb/scripts/train.py", line 119, in main trainer.fit(model, loader_train, loader_valid) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit self._call_and_handle_interrupt( File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run results = self._run_stage() File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage return self._run_train() File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train self.fit_loop.run() File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 268, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance batch_output = self.batch_loop.run(batch, batch_idx) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance result = self._run_optimization( File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization self._optimizer_step(optimizer, opt_idx, batch_idx, closure) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step self.trainer._call_lightning_module_hook( File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1644, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/optim/adam.py", line 157, in step adam(params_with_grad, File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/optim/adam.py", line 213, in adam func(params, File "/home/moritz/anaconda3/envs/surfemb/lib/python3.8/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

Any idea how to resolve this?

also i cant get the training to run with standard settings. i get an outofmemory error on a rtx3070ti (8gb) if i dont run at n-valid = 2 and batch-size = 1