Open ds2268 opened 8 months ago
I have also tried parameters from the paper (batch size 2048, lr=3e-8, etc.). The finetunning is still exploding (loss quickly to 0 and then NaN).
[12-07 18:37:04] (nstream_imagenet/main.py, line 174)=> [ep0 it 3/626] L: 0.6937 Acc: 0.00 lr: 3.1e-05~3.8e-04 Remain: 3:26:47
[12-07 18:40:10] (nstream_imagenet/main.py, line 174)=> [ep0 it313/626] L: 0.0078 Acc: 0.00 lr: 5.5e-04~6.7e-03 Remain: 0:04:24
[12-07 18:43:23] (nstream_imagenet/main.py, line 174)=> [ep0 it625/626] L: 0.0059 Acc: 9.72 lr: 1.1e-03~1.3e-02 Remain: 0:00:00
[12-07 18:44:04] (nstream_imagenet/main.py, line 84)=> [ep0/300] Max (Last) Acc: 8.97 (8.97 o 50000.0) EMA: 0.13 (0.01 o 50000.0) Ep cost: 500.25s, Ev cost: 23.38, Remain: 1 day, 17:32:55, Finish @ 12-09 05:16
[12-07 18:44:06] (nstream_imagenet/main.py, line 60)=> [loader_train.sampler.set_epoch(1)]
[12-07 18:44:13] (nstream_imagenet/main.py, line 174)=> [ep1 it 3/626] L: 0.0059 Acc: 15.62 lr: 1.1e-03~1.3e-02 Remain: 0:18:02
[12-07 18:47:18] (nstream_imagenet/main.py, line 174)=> [ep1 it313/626] L: 0.0055 Acc: 21.09 lr: 1.6e-03~1.9e-02 Remain: 0:03:11
[12-07 18:50:15] (nstream_imagenet/main.py, line 174)=> [ep1 it625/626] L: 0.0056 Acc: 23.61 lr: 2.1e-03~2.6e-02 Remain: 0:00:00
[12-07 18:50:15] (nstream_imagenet/main.py, line 84)=> [ep1/300] Max (Last) Acc: 8.97 (8.97 o 50000.0) EMA: 0.13 (0.01 o 50000.0) Ep cost: 370.16s, Ev cost: -, Remain: 1 day, 6:38:28, Finish @ 12-08 18:28
[12-07 18:50:17] (nstream_imagenet/main.py, line 60)=> [loader_train.sampler.set_epoch(2)]
[12-07 18:50:28] (nstream_imagenet/main.py, line 174)=> [ep2 it 3/626] L: 0.0055 Acc: 23.44 lr: 2.1e-03~2.6e-02 Remain: 0:29:35
[12-07 18:53:36] (nstream_imagenet/main.py, line 174)=> [ep2 it313/626] L: 0.0071 Acc: 13.28 lr: 2.6e-03~3.2e-02 Remain: 0:03:18
[12-07 18:56:33] (nstream_imagenet/main.py, line 174)=> [ep2 it625/626] L: 0.0069 Acc: 5.56 lr: 3.2e-03~3.9e-02 Remain: 0:00:00
[12-07 18:56:33] (nstream_imagenet/main.py, line 84)=> [ep2/300] Max (Last) Acc: 8.97 (8.97 o 50000.0) EMA: 0.13 (0.01 o 50000.0) Ep cost: 376.92s, Ev cost: -, Remain: 1 day, 7:05:45, Finish @ 12-08 19:02
[12-07 18:56:34] (nstream_imagenet/main.py, line 60)=> [loader_train.sampler.set_epoch(3)]
[12-07 18:56:48] (nstream_imagenet/main.py, line 174)=> [ep3 it 3/626] L: 0.0077 Acc: 0.78 lr: 3.2e-03~3.9e-02 Remain: 0:34:59
[12-07 18:59:55] (nstream_imagenet/main.py, line 174)=> [ep3 it313/626] L: 62.9384 Acc: 0.00 lr: 3.7e-03~4.5e-02 Remain: 0:03:20
[12-07 19:02:52] (nstream_imagenet/main.py, line 174)=> [ep3 it625/626] L: 317.5974 Acc: 0.00 lr: 4.2e-03~5.1e-02 Remain: 0:00:00
[12-07 19:02:52] (nstream_imagenet/main.py, line 84)=> [ep3/300] Max (Last) Acc: 8.97 (8.97 o 50000.0) EMA: 0.13 (0.01 o 50000.0) Ep cost: 378.86s, Ev cost: -, Remain: 1 day, 7:09:03, Finish @ 12-08 19:11
[12-07 19:03:08] (nstream_imagenet/main.py, line 174)=> [ep4 it 3/626] L: 267.8481 Acc: 0.00 lr: 4.2e-03~5.1e-02 Remain: 0:38:13
[12-07 19:06:16] (nstream_imagenet/main.py, line 174)=> [ep4 it313/626] L: 352016.5938 Acc: 0.00 lr: 4.7e-03~5.8e-02 Remain: 0:03:21
[12-07 19:09:15] (nstream_imagenet/main.py, line 174)=> [ep4 it625/626] L: 3266225152.0000 Acc: 0.00 lr: 5.3e-03~6.4e-02 Remain: 0:00:00
[12-07 19:09:15] (nstream_imagenet/main.py, line 84)=> [ep4/300] Max (Last) Acc: 8.97 (8.97 o 50000.0) EMA: 0.13 (0.01 o 50000.0) Ep cost: 382.58s, Ev cost: -, Remain: 1 day, 7:21:01, Finish @ 12-08 19:30
[12-07 19:09:31] (nstream_imagenet/main.py, line 174)=> [ep5 it 3/626] L: 3494824192.0000 Acc: 0.00 lr: 5.3e-03~6.4e-02 Remain: 0:38:32
[12-07 19:12:40] (nstream_imagenet/main.py, line 174)=> [ep5 it313/626] L: nan Acc: 1.56 lr: 5.3e-03~6.4e-02 Remain: 0:03:22
[12-07 19:15:39] (nstream_imagenet/main.py, line 174)=> [ep5 it625/626] L: nan Acc: 0.00 lr: 5.3e-03~6.4e-02 Remain: 0:00:00
Hi @ds2268, the 800-ep pre-training seems normal. The fine-tuning loss before explosion (5e-3, close to zero) is also as expected, since we are using BCE loss instead of CE. (ps: we never observed any loss explosion problem in all of our finetuning experiments)
Have you used mixed precision?
I also found that the default batch size should be 2048, maybe you can also try this.
I have tried 2048 configs from the paper, with no success. I think that downstream ImageNet is not using mixed precision. I could only find apex libs in downstream mmdet.
Could you try running with timm==0.5.4?
I am already running with:
timm 0.5.44 torch 1.12.0 torchvision 0.13.1
Looks like the issue with ResNet-50 is related to #27
Honestly I have no idea what the problem is with the fine-tuning code (yes #27 is similar). Maybe you can try again with base_lr < 0.002. I will run this too.
@keyu-tian, I have now pretrained ConvNext-S model (800 epochs) and performed ImageNet finetuning:
It's not yet finished (140 epochs / 200), but looks like it's working on ConvNext-S. The reported results for ConvNext-S are 84.1. I will probably not reach it by 200 epochs, but probably due to only 800 epochs pretraining.
The problem is then really just with the Resnet-50 stability.
@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem.
Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101.
ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere.
I have pre-trained the resnet50 model for 800 epochs. The loss looks fine:
I have then used a pre-trained model for ImageNet fine-tuning and the loss pretty much always "exploded" (bellow).
I am using the original hyperparameters defined in HP_DEFAULT_VALUES over 32 x A100 GPUs with a default batch_size=4096.
Any clues @keyu-tian?