open-mmlab / mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark
https://mmpretrain.readthedocs.io/en/latest/
Apache License 2.0
3.39k stars 1.06k forks source link

[Bug] Unable to train downstream classification task for pretrained SimCLR model based on ConvNext #1599

Open guneetmutreja opened 1 year ago

guneetmutreja commented 1 year ago

Branch

main branch (mmpretrain version)

Describe the bug

Issue explanation

I want to try SimCLR model using ConvNext backbone, I could pretrain a model, the loss reduced for some epochs and then stabilized at around 3.2. I then tried using the saved pre-trained model to train a classifier but in this case it shows loss as nan and accuracy as 16.02 for all the epochs. Can you please help me identify the issue that I am making here?

Pretraining command

python tools/train.py configs/simclr/simclr_convnext_base_16xb32-coslr-200e_custom.py

Config file used for pretraining

_base_ = [
    '../_base_/datasets/custom_bs32_simclr.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py',
]

train_dataloader = dict(batch_size=32)

model = dict(
    type='SimCLR',
    backbone=dict(
        type='ConvNeXt',
        arch='base'),
    neck=dict(
        type='NonLinearNeck',  # SimCLR non-linear neck
        in_channels=1024,
        hid_channels=1024,
        out_channels=128,
        num_layers=2,
        with_avg_pool=False),
    head=dict(
        type='ContrastiveHead',
        loss=dict(type='CrossEntropyLoss'),
        temperature=0.1),
)

optim_wrapper = dict(
    optimizer=dict(lr=4e-3),
    clip_grad=dict(max_norm=5.0),
)

default_hooks = dict(
    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))

Pretraining logs

05/24 08:17:20 - mmengine - INFO - Epoch(train)   [1][100/235]  lr: 8.8189e-05  eta: 15:58:41  time: 0.8107  data_time: 0.0011  memory: 15652  grad_norm: 47.5290  loss: 7.5065
05/24 08:18:42 - mmengine - INFO - Epoch(train)   [1][200/235]  lr: 1.7323e-04  eta: 15:58:15  time: 0.8176  data_time: 0.0010  memory: 15652  grad_norm: 60.8427  loss: 6.6105
05/24 08:19:10 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:20:33 - mmengine - INFO - Epoch(train)   [2][100/235]  lr: 2.8803e-04  eta: 15:59:15  time: 0.8235  data_time: 0.0010  memory: 15652  grad_norm: 112.4985  loss: 5.5410
05/24 08:21:55 - mmengine - INFO - Epoch(train)   [2][200/235]  lr: 3.7307e-04  eta: 15:59:02  time: 0.8202  data_time: 0.0010  memory: 15652  grad_norm: 22.9905  loss: 5.7491
05/24 08:22:24 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:23:46 - mmengine - INFO - Epoch(train)   [3][100/235]  lr: 4.8787e-04  eta: 15:57:35  time: 0.8219  data_time: 0.0011  memory: 15652  grad_norm: 49.0663  loss: 5.5631
05/24 08:25:09 - mmengine - INFO - Epoch(train)   [3][200/235]  lr: 5.7291e-04  eta: 15:56:58  time: 0.8293  data_time: 0.0015  memory: 15652  grad_norm: 43.3131  loss: 4.8379
05/24 08:25:37 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:27:00 - mmengine - INFO - Epoch(train)   [4][100/235]  lr: 6.8772e-04  eta: 15:55:52  time: 0.8322  data_time: 0.0012  memory: 15652  grad_norm: 1.6503  loss: 4.0470
05/24 08:28:23 - mmengine - INFO - Epoch(train)   [4][200/235]  lr: 7.7276e-04  eta: 15:54:48  time: 0.8236  data_time: 0.0013  memory: 15652  grad_norm: 0.0265  loss: 4.1430
05/24 08:28:51 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:29:41 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:30:14 - mmengine - INFO - Epoch(train)   [5][100/235]  lr: 8.8756e-04  eta: 15:53:18  time: 0.8223  data_time: 0.0011  memory: 15652  grad_norm: 1.0941  loss: 4.0479
05/24 08:31:37 - mmengine - INFO - Epoch(train)   [5][200/235]  lr: 9.7260e-04  eta: 15:51:58  time: 0.8227  data_time: 0.0010  memory: 15652  grad_norm: 1.1026  loss: 4.0223
05/24 08:32:05 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:33:28 - mmengine - INFO - Epoch(train)   [6][100/235]  lr: 1.0874e-03  eta: 15:50:15  time: 0.8225  data_time: 0.0010  memory: 15652  grad_norm: 2.4526  loss: 3.9094
05/24 08:34:51 - mmengine - INFO - Epoch(train)   [6][200/235]  lr: 1.1724e-03  eta: 15:48:56  time: 0.8230  data_time: 0.0010  memory: 15652  grad_norm: 2.5394  loss: 4.0677
05/24 08:35:19 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:36:42 - mmengine - INFO - Epoch(train)   [7][100/235]  lr: 1.2872e-03  eta: 15:47:08  time: 0.8236  data_time: 0.0010  memory: 15652  grad_norm: 3.4293  loss: 3.9468
05/24 08:38:04 - mmengine - INFO - Epoch(train)   [7][200/235]  lr: 1.3723e-03  eta: 15:45:50  time: 0.8233  data_time: 0.0011  memory: 15652  grad_norm: 2.6437  loss: 3.8465
05/24 08:38:33 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:39:56 - mmengine - INFO - Epoch(train)   [8][100/235]  lr: 1.4871e-03  eta: 15:44:01  time: 0.8225  data_time: 0.0011  memory: 15652  grad_norm: 1.8691  loss: 3.8582
05/24 08:41:18 - mmengine - INFO - Epoch(train)   [8][200/235]  lr: 1.5721e-03  eta: 15:42:38  time: 0.8228  data_time: 0.0011  memory: 15652  grad_norm: 2.0627  loss: 3.9183
05/24 08:41:46 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:43:09 - mmengine - INFO - Epoch(train)   [9][100/235]  lr: 1.6869e-03  eta: 15:40:45  time: 0.8212  data_time: 0.0010  memory: 15652  grad_norm: 2.5251  loss: 3.8778
05/24 08:43:26 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:44:31 - mmengine - INFO - Epoch(train)   [9][200/235]  lr: 1.7720e-03  eta: 15:39:22  time: 0.8215  data_time: 0.0010  memory: 15652  grad_norm: 2.0051  loss: 3.9301
05/24 08:45:00 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:46:22 - mmengine - INFO - Epoch(train)  [10][100/235]  lr: 1.8868e-03  eta: 15:37:25  time: 0.8206  data_time: 0.0010  memory: 15652  grad_norm: 1.3825  loss: 3.9548
05/24 08:47:45 - mmengine - INFO - Epoch(train)  [10][200/235]  lr: 1.9718e-03  eta: 15:35:58  time: 0.8209  data_time: 0.0010  memory: 15652  grad_norm: 1.1330  loss: 3.8551
05/24 08:48:13 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:48:13 - mmengine - INFO - Saving checkpoint at 10 epochs
05/24 08:49:58 - mmengine - INFO - Epoch(train)  [11][100/235]  lr: 2.0866e-03  eta: 15:33:52  time: 0.8202  data_time: 0.0010  memory: 15652  grad_norm: 1.6851  loss: 3.9194
05/24 08:51:20 - mmengine - INFO - Epoch(train)  [11][200/235]  lr: 2.1717e-03  eta: 15:32:28  time: 0.8222  data_time: 0.0011  memory: 15652  grad_norm: 1.7275  loss: 3.8720
05/24 08:51:48 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:53:11 - mmengine - INFO - Epoch(train)  [12][100/235]  lr: 2.2865e-03  eta: 15:30:35  time: 0.8218  data_time: 0.0011  memory: 15652  grad_norm: 0.7984  loss: 3.7909
05/24 08:54:33 - mmengine - INFO - Epoch(train)  [12][200/235]  lr: 2.3715e-03  eta: 15:29:11  time: 0.8223  data_time: 0.0011  memory: 15652  grad_norm: 1.6156  loss: 3.8137
05/24 08:55:02 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:56:25 - mmengine - INFO - Epoch(train)  [13][100/235]  lr: 2.4863e-03  eta: 15:27:18  time: 0.8212  data_time: 0.0011  memory: 15652  grad_norm: 1.1481  loss: 3.7590
05/24 08:57:30 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 08:57:47 - mmengine - INFO - Epoch(train)  [13][200/235]  lr: 2.5713e-03  eta: 15:25:54  time: 0.8218  data_time: 0.0010  memory: 15652  grad_norm: 0.9012  loss: 3.8700
05/24 08:58:15 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546

05/24 08:59:38 - mmengine - INFO - Epoch(train)  [14][100/235]  lr: 2.6861e-03  eta: 15:24:04  time: 0.8214  data_time: 0.0010  memory: 15652  grad_norm: 1.0098  loss: 3.8282
05/24 09:01:00 - mmengine - INFO - Epoch(train)  [14][200/235]  lr: 2.7712e-03  eta: 15:22:38  time: 0.8197  data_time: 0.0010  memory: 15652  grad_norm: 1.1811  loss: 3.7929
05/24 09:01:28 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:02:51 - mmengine - INFO - Epoch(train)  [15][100/235]  lr: 2.8860e-03  eta: 15:20:37  time: 0.8181  data_time: 0.0011  memory: 15652  grad_norm: 0.6711  loss: 3.8354
05/24 09:04:13 - mmengine - INFO - Epoch(train)  [15][200/235]  lr: 2.9710e-03  eta: 15:19:08  time: 0.8187  data_time: 0.0010  memory: 15652  grad_norm: 0.8028  loss: 3.9236
05/24 09:04:41 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:06:04 - mmengine - INFO - Epoch(train)  [16][100/235]  lr: 3.0858e-03  eta: 15:17:14  time: 0.8216  data_time: 0.0011  memory: 15652  grad_norm: 1.1105  loss: 3.8990
05/24 09:07:26 - mmengine - INFO - Epoch(train)  [16][200/235]  lr: 3.1709e-03  eta: 15:15:53  time: 0.8222  data_time: 0.0011  memory: 15652  grad_norm: 1.0128  loss: 3.8631
05/24 09:07:54 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:09:17 - mmengine - INFO - Epoch(train)  [17][100/235]  lr: 3.2857e-03  eta: 15:14:06  time: 0.8235  data_time: 0.0011  memory: 15652  grad_norm: 1.1465  loss: 3.8694
05/24 09:10:40 - mmengine - INFO - Epoch(train)  [17][200/235]  lr: 3.3707e-03  eta: 15:12:47  time: 0.8231  data_time: 0.0011  memory: 15652  grad_norm: 0.7047  loss: 3.8291
05/24 09:11:08 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:11:13 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:12:31 - mmengine - INFO - Epoch(train)  [18][100/235]  lr: 3.4855e-03  eta: 15:10:58  time: 0.8236  data_time: 0.0011  memory: 15652  grad_norm: 0.6971  loss: 3.7901
05/24 09:13:54 - mmengine - INFO - Epoch(train)  [18][200/235]  lr: 3.5706e-03  eta: 15:09:39  time: 0.8235  data_time: 0.0011  memory: 15652  grad_norm: 0.5476  loss: 3.8809
05/24 09:14:22 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:15:45 - mmengine - INFO - Epoch(train)  [19][100/235]  lr: 3.6854e-03  eta: 15:07:49  time: 0.8232  data_time: 0.0011  memory: 15652  grad_norm: 0.7486  loss: 3.8648
05/24 09:17:07 - mmengine - INFO - Epoch(train)  [19][200/235]  lr: 3.7704e-03  eta: 15:06:29  time: 0.8228  data_time: 0.0011  memory: 15652  grad_norm: 0.6724  loss: 3.8797
05/24 09:17:36 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:18:59 - mmengine - INFO - Epoch(train)  [20][100/235]  lr: 3.8852e-03  eta: 15:04:41  time: 0.8243  data_time: 0.0011  memory: 15652  grad_norm: 0.3854  loss: 3.8307
05/24 09:20:21 - mmengine - INFO - Epoch(train)  [20][200/235]  lr: 3.9702e-03  eta: 15:03:21  time: 0.8254  data_time: 0.0012  memory: 15652  grad_norm: 0.5186  loss: 3.7799
05/24 09:20:50 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:20:50 - mmengine - INFO - Saving checkpoint at 20 epochs
05/24 09:22:34 - mmengine - INFO - Epoch(train)  [21][100/235]  lr: 4.0000e-03  eta: 15:01:30  time: 0.8223  data_time: 0.0010  memory: 15652  grad_norm: 0.5500  loss: 3.8099
05/24 09:23:56 - mmengine - INFO - Epoch(train)  [21][200/235]  lr: 4.0000e-03  eta: 15:00:10  time: 0.8273  data_time: 0.0011  memory: 15652  grad_norm: 1.0582  loss: 3.8480
05/24 09:24:25 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:25:19 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:25:48 - mmengine - INFO - Epoch(train)  [22][100/235]  lr: 3.9999e-03  eta: 14:58:24  time: 0.8264  data_time: 0.0011  memory: 15652  grad_norm: 0.8211  loss: 3.8424
05/24 09:27:10 - mmengine - INFO - Epoch(train)  [22][200/235]  lr: 3.9999e-03  eta: 14:57:05  time: 0.8251  data_time: 0.0010  memory: 15652  grad_norm: 1.1552  loss: 3.7753
05/24 09:27:39 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:29:02 - mmengine - INFO - Epoch(train)  [23][100/235]  lr: 3.9995e-03  eta: 14:55:17  time: 0.8245  data_time: 0.0009  memory: 15652  grad_norm: 0.7850  loss: 3.7244
05/24 09:30:25 - mmengine - INFO - Epoch(train)  [23][200/235]  lr: 3.9995e-03  eta: 14:53:59  time: 0.8259  data_time: 0.0009  memory: 15652  grad_norm: 0.6966  loss: 3.6794
05/24 09:30:53 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:32:16 - mmengine - INFO - Epoch(train)  [24][100/235]  lr: 3.9989e-03  eta: 14:52:11  time: 0.8243  data_time: 0.0010  memory: 15652  grad_norm: 1.0456  loss: 3.6718
05/24 09:33:39 - mmengine - INFO - Epoch(train)  [24][200/235]  lr: 3.9989e-03  eta: 14:50:51  time: 0.8238  data_time: 0.0010  memory: 15652  grad_norm: 1.2114  loss: 3.7017
05/24 09:34:07 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:35:30 - mmengine - INFO - Epoch(train)  [25][100/235]  lr: 3.9980e-03  eta: 14:49:01  time: 0.8229  data_time: 0.0009  memory: 15652  grad_norm: 0.8648  loss: 3.6302
05/24 09:36:52 - mmengine - INFO - Epoch(train)  [25][200/235]  lr: 3.9980e-03  eta: 14:47:40  time: 0.8239  data_time: 0.0009  memory: 15652  grad_norm: 0.7883  loss: 3.6774
05/24 09:37:21 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:38:44 - mmengine - INFO - Epoch(train)  [26][100/235]  lr: 3.9969e-03  eta: 14:45:51  time: 0.8238  data_time: 0.0009  memory: 15652  grad_norm: 0.5536  loss: 3.5323
05/24 09:39:05 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:40:06 - mmengine - INFO - Epoch(train)  [26][200/235]  lr: 3.9969e-03  eta: 14:44:31  time: 0.8247  data_time: 0.0010  memory: 15652  grad_norm: 0.7330  loss: 3.5585
05/24 09:40:35 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:41:58 - mmengine - INFO - Epoch(train)  [27][100/235]  lr: 3.9955e-03  eta: 14:42:42  time: 0.8242  data_time: 0.0010  memory: 15652  grad_norm: 0.7349  loss: 3.7416
05/24 09:43:20 - mmengine - INFO - Epoch(train)  [27][200/235]  lr: 3.9955e-03  eta: 14:41:21  time: 0.8248  data_time: 0.0009  memory: 15652  grad_norm: 0.9094  loss: 3.6156
05/24 09:43:49 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:45:12 - mmengine - INFO - Epoch(train)  [28][100/235]  lr: 3.9939e-03  eta: 14:39:32  time: 0.8250  data_time: 0.0009  memory: 15652  grad_norm: 0.6616  loss: 3.6236
05/24 09:46:34 - mmengine - INFO - Epoch(train)  [28][200/235]  lr: 3.9939e-03  eta: 14:38:12  time: 0.8245  data_time: 0.0009  memory: 15652  grad_norm: 0.6423  loss: 3.6834
05/24 09:47:03 - mmengine - INFO - Exp name: simclr_convnext_base_16xb32-coslr-200e_custom_20230524_081546
05/24 09:48:26 - mmengine - INFO - Epoch(train)  [29][100/235]  lr: 3.9920e-03  eta: 14:36:22  time: 0.8239  data_time: 0.0009  memory: 15652  grad_norm: 0.9037  loss: 3.5259
05/24 09:49:48 - mmengine - INFO - Epoch(train)  [29][200/235]  lr: 3.9920e-03  eta: 14:35:00  time: 0.8232  data_time: 0.0009  memory: 15652  grad_norm: 0.7693  loss: 3.5962
......................................................................................................

Command for downstream task

python tools/train.py configs/convnext/convnext-base_32xb128_custom.py --amp

Config file used for downstream task

_base_ = [
    '../_base_/models/convnext/convnext-base.py',
    '../_base_/datasets/custom_bs64_swin_224.py',
    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
    '../_base_/default_runtime.py',
]

train_dataloader = dict(batch_size=128)

model = dict(
    backbone=dict(frozen_stages=3,
        init_cfg=dict(
            type='Pretrained',
            checkpoint='work_dirs/simclr_convnext_base_16xb32-coslr-200e_custom/epoch_90.pth',
            prefix='backbone',
        )),
    head=dict(num_classes=4),
)

data_root = 'data/custom/'

train_dataloader = dict(
    batch_size=64)

val_dataloader = dict(
    dataset=dict(
        type='CustomDataset',
        data_root=data_root,
        ann_file='meta/val.txt',    
        data_prefix='val/1/',
    ))
val_evaluator = dict(type='Accuracy', topk=(1, 2))

test_dataloader = val_dataloader
test_evaluator = val_evaluator

optim_wrapper = dict(
    optimizer=dict(lr=4e-3),
    clip_grad=None,
)

custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]

Downstream task logs

05/26 11:26:30 - mmengine - INFO - paramwise_options -- backbone.norm3.weight:weight_decay=0.0
05/26 11:26:30 - mmengine - INFO - paramwise_options -- backbone.norm3.bias:weight_decay=0.0
05/26 11:26:30 - mmengine - INFO - paramwise_options -- head.fc.bias:weight_decay=0.0
05/26 11:26:31 - mmengine - INFO - load backbone in model from: work_dirs/simclr_convnext_base_16xb32-coslr-200e_custom/epoch_90.pth
Loads checkpoint by local backend from path: work_dirs/simclr_convnext_base_16xb32-coslr-200e_custom/epoch_90.pth
05/26 11:26:31 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
05/26 11:26:31 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
05/26 11:26:31 - mmengine - INFO - Checkpoints will be saved to /home/mutr_gu/Documents/mmpretrain/work_dirs/convnext-base_32xb128_custom.
05/26 11:26:49 - mmengine - INFO - Epoch(train)   [1][100/118]  lr: 1.7170e-04  eta: 1:45:29  time: 0.1650  data_time: 0.0007  memory: 1880  loss: nan
05/26 11:26:52 - mmengine - INFO - Exp name: convnext-base_32xb128_custom_20230526_112621
05/26 11:26:52 - mmengine - INFO - Saving checkpoint at 1 epochs
05/26 11:27:11 - mmengine - INFO - Epoch(val) [1][12/12]    accuracy/top1: 16.2092  accuracy/top2: 26.2745  data_time: 0.0542  time: 0.3024
05/26 11:27:28 - mmengine - INFO - Epoch(train)   [2][100/118]  lr: 3.7158e-04  eta: 1:41:12  time: 0.1637  data_time: 0.0005  memory: 1880  loss: nan
05/26 11:27:31 - mmengine - INFO - Exp name: convnext-base_32xb128_custom_20230526_112621
05/26 11:27:31 - mmengine - INFO - Saving checkpoint at 2 epochs
05/26 11:27:48 - mmengine - INFO - Epoch(val) [2][12/12]    accuracy/top1: 16.2092  accuracy/top2: 26.2745  data_time: 0.0245  time: 0.2753
05/26 11:28:05 - mmengine - INFO - Epoch(train)   [3][100/118]  lr: 5.7147e-04  eta: 1:39:42  time: 0.1636  data_time: 0.0005  memory: 1880  loss: nan
05/26 11:28:08 - mmengine - INFO - Exp name: convnext-base_32xb128_custom_20230526_112621
05/26 11:28:08 - mmengine - INFO - Saving checkpoint at 3 epochs
05/26 11:28:28 - mmengine - INFO - Epoch(val) [3][12/12]    accuracy/top1: 16.2092  accuracy/top2: 26.2745  data_time: 0.0216  time: 0.2705

Environment

{'sys.platform': 'linux',
 'Python': '3.8.16 (default, Mar  2 2023, 03:21:46) [GCC 11.2.0]',
 'CUDA available': True,
 'numpy_random_seed': 2147483648,
 'GPU 0': 'NVIDIA TITAN RTX',
 'CUDA_HOME': '/usr/local/cuda',
 'NVCC': 'Cuda compilation tools, release 11.0, V11.0.194',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'PyTorch': '1.8.2',
 'TorchVision': '0.9.2',
 'OpenCV': '4.7.0',
 'MMEngine': '0.7.3',
 'MMCV': '2.0.0rc4',
 'MMPreTrain': '1.0.0rc8+4dd8a86'}

Other information

No response

guneetmutreja commented 1 year ago

@Ezra-Yu Can you please help me resolving the issue?

guneetmutreja commented 1 year ago

@fangyixiao18 Can you please/advice me resolving the issue?