print loss: nan when I train CAE.

LUOBO123LUOBO123 commented 1 year ago

Branch

1.x branch (1.x version, such as v1.0.0rc2, or dev-1.x branch)

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Environment

consistent with the official

Describe the bug

Hellow,I change the input resolution to 416*416 when I train custom datasets. When the network is trained for 49 epochs, the print loss is nan.What could be the reason for this?

Reproduces the problem - code sample

No response

Reproduces the problem - command or script

No response

Reproduces the problem - error message

No response

Additional information

his is my parameters. I train the model with two 2080ti cards.

model = dict( type='CAE', backbone=dict( type='CAEViT', arch='b', patch_size=16, init_values=0.1, qkv_bias=False), neck=dict( type='CAENeck', patch_size=16, embed_dims=768, num_heads=12, regressor_depth=4, decoder_depth=4, mlp_ratio=4, init_values=0.1), head=dict( type='CAEHead', tokenizer_path='cae_ckpt/dalle_encoder.pth', lambd=2), base_momentum=0.0) data_source = 'ImageNet' dataset_type = 'SingleViewDataset' img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) train_pipeline = [ dict(type='RandomHorizontalFlip', p=0.5), dict( type='RandomResizedCropAndInterpolationWithTwoPic', size=416, second_size=208, interpolation='bicubic', second_interpolation='lanczos', scale=(0.08, 1.0)), dict(type='ToTensor'), dict( type='BEiTMaskGenerator', input_size=(26, 26), num_masking_patches=75, max_num_patches=None, min_num_patches=16) ] prefetch = False data = dict( samples_per_gpu=5, workers_per_gpu=8, train=dict( type='SingleViewDataset', data_source=dict( type='ImageNet', data_prefix='data_own/imagenet/train/n01440764/', ann_file='data_own/imagenet/meta/train.txt'), pipeline=[ dict(type='RandomHorizontalFlip', p=0.5), dict( type='RandomResizedCropAndInterpolationWithTwoPic', size=416, second_size=208, interpolation='bicubic', second_interpolation='lanczos', scale=(0.08, 1.0)), dict(type='ToTensor'), dict( type='BEiTMaskGenerator', input_size=(26, 26), num_masking_patches=75, max_num_patches=None, min_num_patches=16) ], prefetch=False)) optimizer = dict( type='AdamW', lr=0.0015, betas=(0.9, 0.999), weight_decay=0.05, paramwise_options=dict( norm=dict(weight_decay=0.0), bias=dict(weight_decay=0.0), gamma=dict(weight_decay=0.0))) optimizer_config = dict(grad_clip=dict(max_norm=3.0)) lr_config = dict( policy='StepFixCosineAnnealing', min_lr=1e-05, warmup='linear', warmup_iters=10, warmup_ratio=0.0001, warmup_by_epoch=True, by_epoch=False) runner = dict(type='EpochBasedRunner', max_epochs=300)

YuanLiuuuuuu commented 1 year ago

Better tune these hyperparameters, since your config is not consistent with that we provide.

LUOBO123LUOBO123 commented 1 year ago

Thank you for your reply. I train my custom datasets, so I change these hyperparameters. Now I turn the learning rate down.

YuanLiuuuuuu commented 1 year ago

Closed with being solved. If you have problems, feel free to reopen this issue.

LUOBO123LUOBO123 commented 1 year ago

Closed with being solved. If you have problems, feel free to reopen this issue.

OK

open-mmlab / mmselfsup