open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.18k stars 356 forks source link

[Bug] 中断后恢复训练报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #1538

Open Helen-Ren-yi opened 6 months ago

Helen-Ren-yi commented 6 months ago

Prerequisite

Environment

System environment: sys.platform: linux Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 473525473 GPU 0,1: GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.2, V11.2.152 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.9.1+cu111 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 4} dist_cfg: {'backend': 'nccl'} seed: 473525473 Distributed launcher: none Distributed training: False GPU number: 1

Reproduces the problem - code sample

04/30 07:45:22 - mmengine - INFO - Config: custom_hooks = [ dict(interval=1, type='BasicVisualizationHook'), ] dataset_type = 'BasicImageDataset' default_hooks = dict( checkpoint=dict( by_epoch=False, interval=5000, max_keep_ckpts=10, out_dir='./work_dirs', rule=[ 'less', 'greater', 'greater', ], save_best=[ 'MAE', 'PSNR', 'SSIM', ], save_optimizer=True, type='CheckpointHook'), logger=dict(interval=100, type='LoggerHook'), param_scheduler=dict(type='ParamSchedulerHook'), sampler_seed=dict(type='DistSamplerSeedHook'), timer=dict(type='IterTimerHook')) default_scope = 'mmagic' env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=4)) experiment_name = 'glean_x8_2xb8_cat' inference_pipeline = [ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( backend='pillow', interpolation='bicubic', keys=[ 'img', ], scale=( 32, 32, ), type='Resize'), dict(type='PackInputs'), ] launcher = 'none' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False, type='LogProcessor', window_size=100) model = dict( data_preprocessor=dict( mean=[ 127.5, 127.5, 127.5, ], std=[ 127.5, 127.5, 127.5, ], type='DataPreprocessor'), discriminator=dict( in_size=256, init_cfg=dict( checkpoint= 'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth', prefix='discriminator', type='Pretrained'), type='StyleGANv2Discriminator'), gan_loss=dict( fake_label_val=0, gan_type='vanilla', loss_weight=0.01, real_label_val=1.0, type='GANLoss'), generator=dict( in_size=32, init_cfg=dict( checkpoint= 'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth', prefix='generator_ema', type='Pretrained'), out_size=256, style_channels=512, type='GLEANStyleGANv2'), perceptual_loss=dict( criterion='mse', layer_weights=dict({'21': 1.0}), norm_img=False, perceptual_weight=0.01, pretrained='torchvision://vgg16', style_weight=0, type='PerceptualLoss', vgg_type='vgg16'), pixel_loss=dict(loss_weight=1.0, reduction='mean', type='MSELoss'), test_cfg=dict(), train_cfg=dict(), type='SRGAN') model_wrapper_cfg = dict( find_unused_parameters=True, type='MMSeparateDistributedDataParallel') optim_wrapper = dict( constructor='MultiOptimWrapperConstructor', discriminator=dict( optimizer=dict(betas=( 0.9, 0.99, ), lr=0.0001, type='Adam'), type='OptimWrapper'), generator=dict( optimizer=dict(betas=( 0.9, 0.99, ), lr=0.0001, type='Adam'), type='OptimWrapper')) param_scheduler = dict( T_max=600000, by_epoch=False, eta_min=1e-07, type='CosineAnnealingLR') resume = True save_dir = './work_dirs' scale = 8 test_cfg = dict(type='MultiTestLoop') test_dataloader = dict( dataset=dict( ann_file='meta_info_Cat100_GT.txt', data_prefix=dict(gt='GT', img='BIx8_down'), data_root='data/cat_test', metainfo=dict(dataset_type='cat', task_name='sisr'), pipeline=[ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( channel_order='rgb', color_type='color', key='gt', type='LoadImageFromFile'), dict(type='PackInputs'), ], type='BasicImageDataset'), drop_last=False, num_workers=8, persistent_workers=False, pin_memory=True, sampler=dict(shuffle=False, type='DefaultSampler')) test_evaluator = [ dict(type='MAE'), dict(type='PSNR'), dict(type='SSIM'), ] test_pipeline = [ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( channel_order='rgb', color_type='color', key='gt', type='LoadImageFromFile'), dict(type='PackInputs'), ] train_cfg = dict( max_iters=300000, type='IterBasedTrainLoop', val_interval=5000) train_dataloader = dict( batch_size=8, dataset=dict( ann_file='meta_info_LSUNcat_GT.txt', data_prefix=dict(gt='GT', img='BIx8_down'), data_root='data/cat_train', metainfo=dict(dataset_type='cat', task_name='sisr'), pipeline=[ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( channel_order='rgb', color_type='color', key='gt', type='LoadImageFromFile'), dict( direction='horizontal', flip_ratio=0.5, keys=[ 'img', 'gt', ], type='Flip'), dict(type='PackInputs'), ], type='BasicImageDataset'), num_workers=8, persistent_workers=False, pin_memory=True, sampler=dict(shuffle=True, type='InfiniteSampler')) train_pipeline = [ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( channel_order='rgb', color_type='color', key='gt', type='LoadImageFromFile'), dict( direction='horizontal', flip_ratio=0.5, keys=[ 'img', 'gt', ], type='Flip'), dict(type='PackInputs'), ] val_cfg = dict(type='MultiValLoop') val_dataloader = dict( dataset=dict( ann_file='meta_info_Cat100_GT.txt', data_prefix=dict(gt='GT', img='BIx8_down'), data_root='data/cat_test', metainfo=dict(dataset_type='cat', task_name='sisr'), pipeline=[ dict( channel_order='rgb', color_type='color', key='img', type='LoadImageFromFile'), dict( channel_order='rgb', color_type='color', key='gt', type='LoadImageFromFile'), dict(type='PackInputs'), ], type='BasicImageDataset'), drop_last=False, num_workers=8, persistent_workers=False, pin_memory=True, sampler=dict(shuffle=False, type='DefaultSampler')) val_evaluator = [ dict(type='MAE'), dict(type='PSNR'), dict(type='SSIM'), ] vis_backends = [ dict(type='LocalVisBackend'), ] visualizer = dict( bgr2rgb=True, fn_key='gt_path', img_keys=[ 'gt_img', 'input', 'pred_img', ], type='ConcatImageVisualizer', vis_backends=[ dict(type='LocalVisBackend'), ]) work_dir = './work_dirs/glean_x8_2xb8_cat'

04/30 07:45:32 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used. 04/30 07:45:32 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook (BELOW_NORMAL) LoggerHook

before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (VERY_LOW ) CheckpointHook

before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook

before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook

after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) BasicVisualizationHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

before_val: (VERY_HIGH ) RuntimeInfoHook

before_val_epoch: (NORMAL ) IterTimerHook

before_val_iter: (NORMAL ) IterTimerHook

after_val_iter: (NORMAL ) IterTimerHook (NORMAL ) BasicVisualizationHook (BELOW_NORMAL) LoggerHook

after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

after_val: (VERY_HIGH ) RuntimeInfoHook

after_train: (VERY_HIGH ) RuntimeInfoHook (VERY_LOW ) CheckpointHook

before_test: (VERY_HIGH ) RuntimeInfoHook

before_test_epoch: (NORMAL ) IterTimerHook

before_test_iter: (NORMAL ) IterTimerHook

after_test_iter: (NORMAL ) IterTimerHook (NORMAL ) BasicVisualizationHook (BELOW_NORMAL) LoggerHook

after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook

after_test: (VERY_HIGH ) RuntimeInfoHook

after_run: (BELOW_NORMAL) LoggerHook

04/30 07:45:33 - mmengine - INFO - Working directory: ./work_dirs/glean_x8_2xb8_cat 04/30 07:45:33 - mmengine - INFO - Log directory: /root/glean/work_dirs/glean_x8_2xb8_cat/20240430_074521 04/30 07:45:33 - mmengine - WARNING - cat is not a meta file, simply parsed as meta information 04/30 07:45:33 - mmengine - WARNING - sisr is not a meta file, simply parsed as meta information 04/30 07:45:35 - mmengine - INFO - Add to optimizer 'generator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'generator'. 04/30 07:45:35 - mmengine - INFO - Add to optimizer 'discriminator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'discriminator'. 04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class MAE. 04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class PSNR. 04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class SSIM. 04/30 07:45:36 - mmengine - INFO - load generator_ema in model from: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth Loads checkpoint by http backend from path: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth 04/30 07:45:36 - mmengine - WARNING - The model and loaded state dict do not match exactly

Reproduces the problem - command or script

python tools/train.py configs/glean/glean_x8_2xb8_cat.py --resume

Reproduces the problem - error message

04/30 06:43:01 - mmengine - INFO - Saving checkpoint at 275000 iterations Switch to evaluation style mode: single 04/30 06:43:25 - mmengine - INFO - Iter(val) [100/100] eta: 0:00:00 time: 0.1712 data_time: 0.0235 memor3032 04/30 06:43:26 - mmengine - INFO - Iter(val) [100/100] MAE: 0.0457 PSNR: 23.7792 SSIM: 0.5953 data_time:0234 time: 0.1709 Traceback (most recent call last): File "tools/train.py", line 114, in main() File "tools/train.py", line 107, in main runner.train() File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1778, in tra model = self.train_loop.run() # type: ignore File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/loops.py", line 294, in run self.runner.val_loop.run() File "/root/glean/mmagic/engine/runner/multi_loops.py", line 247, in run self._runner.call_hook('after_val_epoch', metrics=multi_metric) File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1841, in calook getattr(hook, fn_name)(self, **kwargs) File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 361, after_val_epoch self._save_best_checkpoint(runner, metrics) File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 521, _save_best_checkpoint if key_score is None or not self.is_better_than[key_indicator]( File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 123, rule_map = {'greater': lambda x, y: x > y, 'less': lambda x, y: x < y} RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Additional information

我在命令行后增加--resume命令后出现的情况,在恢复训练进行5000次迭代后,模型自动保存权重、进行验证,过后打算重新再进入下一个5000次迭代的循环中时,报错,无法继续自动进行训练,报错内容如上。