open-mmlab / mmselfsup

OpenMMLab Self-Supervised Learning Toolbox and Benchmark
https://mmselfsup.readthedocs.io/en/latest/
Apache License 2.0
3.17k stars 428 forks source link

Meeting error when trained MAE model by using the CIFAR #697

Open Jia-Baos opened 1 year ago

Jia-Baos commented 1 year ago

This is the config file:

>>>>>>>>>>>>>>>>>>>>> Start of Changed >>>>>>>>>>>>>>>>>>>>>>>>>

base = [ '../base/models/mae_vit-base-p16.py',

'../base/datasets/cifar.py',

'../_base_/schedules/adamw_coslr-200e_in1k.py',
'../_base_/default_runtime.py',

]

dataset settings

data_root = 'data/cifar'

file_client_args = dict(backend='disk')

data_source = 'CIFAR10' dataset_type = 'SingleViewDataset' img_norm_cfg = dict(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) train_pipeline = [ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), ] test_pipeline = []

prefetch

prefetch = False if not prefetch: train_pipeline.extend( [dict(type='ToTensor'), dict(type='Normalize', img_norm_cfg)]) test_pipeline.extend( [dict(type='ToTensor'), dict(type='Normalize', img_norm_cfg)])

dataset summary

data = dict( samples_per_gpu=128, workers_per_gpu=2, train=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=train_pipeline, prefetch=prefetch), val=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=test_pipeline, prefetch=prefetch), test=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=test_pipeline, prefetch=prefetch))

evaluation = dict(interval=10, topk=(1, 5))

dataset 8 x 512

train_dataloader = dict(batch_size=128, num_workers=8)

<<<<<<<<<<<<<<<<<<<<<< End of Changed <<<<<<<<<<<<<<<<<<<<<<<<<<<

optimizer wrapper

optimizer = dict( type='AdamW', lr=1.5e-4 * 4096 / 256, betas=(0.9, 0.95), weight_decay=0.05) optim_wrapper = dict( type='OptimWrapper', optimizer=optimizer, paramwise_cfg=dict( custom_keys={ 'ln': dict(decay_mult=0.0), 'bias': dict(decay_mult=0.0), 'pos_embed': dict(decay_mult=0.), 'mask_token': dict(decay_mult=0.), 'cls_token': dict(decay_mult=0.) }))

learning rate scheduler

param_scheduler = [ dict( type='LinearLR', start_factor=1e-4, by_epoch=True, begin=0, end=40, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=360, by_epoch=True, begin=40, end=400, convert_to_iter_based=True) ]

runtime settings

pre-train for 400 epochs

train_cfg = dict(max_epochs=3) default_hooks = dict( logger=dict(type='LoggerHook', interval=100),

only keeps the latest 3 checkpoints

checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))

randomness

randomness = dict(seed=0, diff_rank_seed=True) resume = True

This is the log:

During handling of the above exception, another exception occurred: 2023/02/20 01:19:00 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] CUDA available: True numpy_random_seed: 0 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /data/apps/cuda/11.1 NVCC: Cuda compilation tools, release 11.1, V11.1.74 GCC: gcc (GCC) 7.3.0 PyTorch: 1.10.0+cu111 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 0 diff_rank_seed: True Distributed launcher: none Distributed training: False GPU number: 1

2023/02/20 01:19:00 - mmengine - INFO - Config: model = dict( type='MAE', data_preprocessor=dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=True), backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75), neck=dict( type='MAEPretrainDecoder', patch_size=16, in_chans=3, embed_dim=768, decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16, mlp_ratio=4.0), head=dict( type='MAEPretrainHead', norm_pix=True, patch_size=16, loss=dict(type='MAEReconstructionLoss')), init_cfg=[ dict(type='Xavier', distribution='uniform', layer='Linear'), dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0) ]) optimizer = dict(type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05) optim_wrapper = dict( type='OptimWrapper', optimizer=dict( type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05), paramwise_cfg=dict( custom_keys=dict( ln=dict(decay_mult=0.0), bias=dict(decay_mult=0.0), pos_embed=dict(decay_mult=0.0), mask_token=dict(decay_mult=0.0), cls_token=dict(decay_mult=0.0)))) param_scheduler = [ dict( type='LinearLR', start_factor=0.0001, by_epoch=True, begin=0, end=40, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=360, by_epoch=True, begin=40, end=400, convert_to_iter_based=True) ] train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=3) default_scope = 'mmselfsup' default_hooks = dict( runtime_info=dict(type='RuntimeInfoHook'), timer=dict(type='IterTimerHook'), logger=dict(type='LoggerHook', interval=100), param_scheduler=dict(type='ParamSchedulerHook'), checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3), sampler_seed=dict(type='DistSamplerSeedHook')) env_cfg = dict( cudnn_benchmark=False, mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), dist_cfg=dict(backend='nccl')) log_processor = dict( window_size=10, custom_cfg=[dict(data_src='', method='mean', window_size='global')]) vis_backends = [dict(type='LocalVisBackend')] visualizer = dict( type='SelfSupVisualizer', vis_backends=[dict(type='LocalVisBackend')], name='visualizer') log_level = 'INFO' load_from = None resume = True data_source = 'CIFAR10' dataset_type = 'SingleViewDataset' img_norm_cfg = dict(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) train_pipeline = [ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ] test_pipeline = [ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ] prefetch = False data = dict( samples_per_gpu=128, workers_per_gpu=2, train=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False), val=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False), test=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False)) train_dataloader = dict(batch_size=128, num_workers=8) randomness = dict(seed=0, diff_rank_seed=True) launcher = 'none' work_dir = 'work'

2023/02/20 01:19:00 - mmengine - WARNING - The "visualizer" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:00 - mmengine - WARNING - The "vis_backend" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:01 - mmengine - WARNING - The "model" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:06 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used. 2023/02/20 01:19:06 - mmengine - WARNING - The "hook" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:06 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook


before_train: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook


before_train_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook


before_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook


after_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook


after_train_epoch: (NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook


before_val_epoch: (NORMAL ) IterTimerHook


before_val_iter: (NORMAL ) IterTimerHook


after_val_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_val_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook


before_test_epoch: (NORMAL ) IterTimerHook


before_test_iter: (NORMAL ) IterTimerHook


after_test_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_test_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_run: (BELOW_NORMAL) LoggerHook


2023/02/20 01:19:07 - mmengine - WARNING - The "loop" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead.

Traceback (most recent call last): File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 43, in init super().init(runner, dataloader) File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in init self.dataloader = runner.build_dataloader( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1331, in build_dataloader dataset_cfg = dataloader_cfg.pop('dataset') KeyError: 'dataset'

Traceback (most recent call last): File "tools/train.py", line 99, in main() File "tools/train.py", line 95, in main runner.train() File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1656, in train self._train_loop = self.build_train_loop( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1448, in build_train_loop loop = LOOPS.build( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/registry.py", line 521, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 135, in build_from_cfg raise type(e)( KeyError: "class EpochBasedTrainLoop in mmengine/runner/loops.py: 'dataset'"

YuanLiuuuuuu commented 1 year ago

It seems your config is from MMSelfSup 0.x version, but the log shows you use MMEngine to start your training job. Try to pull and checkout to MMSelfSup 1.x branch, and use the new config of MAE.

Jia-Baos commented 1 year ago

emm, thsat's a great idea. using MMSelfSup 1.x branch to train MAE on my own dataset(refer: https://mmselfsup.readthedocs.io/zh_CN/dev-1.x/user_guides/4_pretrain_custom_dataset.html), it has worked, but now i want to train MAE on CIFAR10, how to change the dataset part of config, i can't find an example from (refer: https://github.com/open-mmlab/mmselfsup/tree/1.x/configs/selfsup/_base_/datasets), which is all about ImageNet

YuanLiuuuuuu commented 1 year ago

You can refer to this doc. Before you pre-train your model on CIFAR10, you should refactor the folder of CIFAR10 to the style of ImageNet1K and also create a ImageNet1K-style annotation file. After that, you can make few changes to the config, but replace data_root and ann_file with your own settings.

Jia-Baos commented 1 year ago

Thanks~

mrFocusXin commented 1 year ago

Hi, Have you solved the problem? I also wan to use CIFAR in MMselfsup, Can you provide your CIFAR directory structure? Thanks!

Jia-Baos commented 1 year ago

emm, i didn't use the CIFAR10, just organized my own datasets into the style of ImageNet1K and also create a ImageNet1K-style annotation file. If you want to use the CIFAR, a good idea is to change the CIFAR to ImageNet1K style or mmcls.CustomDataset style (refer: https://mmselfsup.readthedocs.io/zh_CN/dev-1.x/user_guides/4_pretrain_custom_dataset.html).

mrFocusXin commented 1 year ago

@Jia-Baos Thank you for your reply! What is the specific format of ImageNet1K-style? I found the ImageNet1K dataset on google, but not the meta folder, and I only found this picture in mmselfsup doc, but I don't know the specific structure, for example, what should be under the meta folder? Now I prepared my own data set, but I don't know exactly what ImageNet1K-style is, would you mind giving me an example or documentation? Thank you very much. Your reply is very helpful to me! image

Jia-Baos commented 1 year ago

I‘m very pleased that i could be help, you can refer this file 北京超算30区使用MMClassification训练花卉图片分类模型.pdf in https://github.com/Jia-Baos/OpenMM.

mrFocusXin commented 1 year ago

Thank you very much for your reply. It's works!!

alaa-shubbak commented 1 year ago

I‘m very pleased that i could be help, you can refer this file 北京超算30区使用MMClassification训练花卉图片分类模型.pdf in https://github.com/Jia-Baos/OpenMM.

hello , i can not find such pdf file. also i faced some issue while training my model on my custom dataset. i got such error: KeyError: 'SelfSupVisualizer is not in the visualizer registry. Please check whether the value ofSelfSupVisualizeris correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'

Jia-Baos commented 1 year ago

Emm, you can find this file in my repository openMM. As for the error, I have turned to optical flow estimation, so that i can't help you more......