open-mmlab / mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
https://mmagic.readthedocs.io/en/latest/
Apache License 2.0
6.95k stars 1.06k forks source link

[Bug] When using resmue to continue training, AssertionError: If capturable=False, state_steps should not be CUDA tensors. occurs #1988

Open zdyshine opened 1 year ago

zdyshine commented 1 year ago

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmagic

Environment

sys.platform: linux Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.1, V11.1.105 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.0+cu113 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.0+cu113 OpenCV: 4.5.1 MMEngine: 0.8.4 MMCV: 2.0.1 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.3 MMagic: 1.0.2dev0+unknown

Reproduces the problem - code sample

None

Reproduces the problem - command or script

change: configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py just change 10 iter print log, and 50 iter to save checkpoint, no val and no test

first: RUN python tools/train.py configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py, when 110iter stop. then: RUN python tools/train.py configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py --resume

Reproduces the problem - error message

08/21 10:03:10 - mmengine - INFO - Working directory: ./work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds 08/21 10:03:10 - mmengine - INFO - Log directory: /test/mmagic/work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds/20230821_100254 08/21 10:03:19 - mmengine - INFO - Add to optimizer 'generator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'generator'. 08/21 10:03:24 - mmengine - INFO - Auto resumed from the latest checkpoint ./work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds/iter_100.pth. Loads checkpoint by local backend from path: ./work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds/iter_100.pth 08/21 10:03:24 - mmengine - INFO - Load checkpoint from ./work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds/iter_100.pth 08/21 10:03:24 - mmengine - INFO - resumed epoch: 0, iter: 100 08/21 10:03:24 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 08/21 10:03:24 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 08/21 10:03:24 - mmengine - INFO - Checkpoints will be saved to ./work_dirs/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds. ================== parsed_losses_g: tensor(0.0444, device='cuda:0', grad_fn=) Traceback (most recent call last): File "tools/train.py", line 114, in main() File "tools/train.py", line 107, in main runner.train() File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1746, in train model = self.train_loop.run() # type: ignore File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 278, in run self.run_iter(data_batch) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 301, in run_iter outputs = self.runner.model.train_step( File "/test/mmagic/mmagic/models/editors/real_basicvsr/real_basicvsr.py", line 169, in train_step log_vars_d = self.g_step_with_optim( File "/test/mmagic/mmagic/models/editors/srgan/srgan.py", line 213, in g_step_with_optim g_optim_wrapper.update_params(parsed_losses_g) File "/opt/conda/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 205, in update_params self.step(step_kwargs) File "/opt/conda/lib/python3.8/site-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper return wrapped(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 257, in step self.optimizer.step(kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/optim/adam.py", line 157, in step adam(params_with_grad, File "/opt/conda/lib/python3.8/site-packages/torch/optim/adam.py", line 213, in adam func(params, File "/opt/conda/lib/python3.8/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

Additional information

No response

chen12304 commented 1 year ago

same problem when resume basicvsr-pp,how to solve it

jiehuang165 commented 8 months ago

It's bug of pytorch. Upgrade to pytorch 1.12.1 solve this problem. reference: https://github.com/pytorch/pytorch/issues/80809