keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.42k stars 82 forks source link

The time of training resnet50 #8

Closed ModulatedConvolutionalNetworks closed 1 year ago

ModulatedConvolutionalNetworks commented 1 year ago

Thanks for your wonderful job! I find the time of training resnet50 on V100 is very log. Can you offer the log of training resnet50, and offer the time of training resnet50. Thanks!

keyu-tian commented 1 year ago

Thanks! For an 800-epoch pretraining on a ResNet50 with a batch-size of 4096, it takes about 48h on our 32 A100s. Using V100s could be a bit more slow.

The pretrain_log.txt should be like this:

{"name": "spark_github_r50_800ep", "cmd": "--local_rank=0 --exp_name spark_github_r50_800ep --exp_dir <some_dir> --ep=800 --wp_ep=20 --tb_lg_dir=<some_dir>, "git_commit_id": "a2c2ea206bc3df58ad9c0b13d1c0e5b5558dac0e", "git_commit_msg": "[upd] READMEs & cfgs", "model": "resnet50"}

{"cur_ep": "", "last_L": 0.0, "rema": "", "fini": ""}
{"cur_ep": "1/800", "last_L": 0.9538699577565486, "rema": "2 days, 3:39:11", "fini": "03-09 03:55"}
{"cur_ep": "2/800", "last_L": 0.8007919795442218, "rema": "2 days, 0:26:11", "fini": "03-09 00:45"}
{"cur_ep": "3/800", "last_L": 0.6852362458579266, "rema": "2 days, 0:26:24", "fini": "03-09 00:49"}
{"cur_ep": "4/800", "last_L": 0.6151618401820477, "rema": "2 days, 0:20:45", "fini": "03-09 00:47"}
...

Some of the stdout_backup.txt should be like this:

[03-07 00:12:05] (main.py                 , line  74)=> [PT start] from ep0
[03-07 00:12:20] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 0:  [  0/313]  eta: 1:16:45  max_lr: 0.00002  last_loss: 1.0071 (1.0071)  orig_norm: 0.1493 (0.1493)  iter: 14.7154s  data: 0.0002s
[03-07 00:14:08] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 0:  [156/313]  eta: 0:02:03  max_lr: 0.00010  last_loss: 0.9669 (0.9930)  orig_norm: 0.5328 (0.2676)  iter: 0.6894s  data: 0.0004s
[03-07 00:15:56] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 0:  [312/313]  eta: 0:00:00  max_lr: 0.00017  last_loss: 0.8787 (0.9538)  orig_norm: 0.9295 (0.5305)  iter: 0.6938s  data: 0.0042s
[03-07 00:15:56] (ial/sparko/utils/misc.py, line 333)=> [PT] Epoch 0:   Total time:      0:03:50   (0.737 s / it)
[03-07 00:15:57] (main.py                 , line  94)=>   [*] [ep0/800]    Min/Last Recon Loss: 0.9539 0.9539,    Cost: 232.73s,    Remain: 2 days, 3:39:11,    Finish @ 03-09 03:55
[03-07 00:15:58] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 1:  [  0/313]  eta: 0:04:17  max_lr: 0.00018  last_loss: 0.8811 (0.8811)  orig_norm: 0.9523 (0.9523)  iter: 0.8220s  data: 0.0001s
[03-07 00:17:45] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 1:  [156/313]  eta: 0:01:48  max_lr: 0.00025  last_loss: 0.8076 (0.8385)  orig_norm: 1.0336 (0.9978)  iter: 0.6860s  data: 0.0004s
[03-07 00:19:33] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 1:  [312/313]  eta: 0:00:00  max_lr: 0.00033  last_loss: 0.7277 (0.8011)  orig_norm: 1.0739 (1.0324)  iter: 0.6958s  data: 0.0043s
[03-07 00:19:33] (ial/sparko/utils/misc.py, line 333)=> [PT] Epoch 1:   Total time:      0:03:36   (0.691 s / it)
[03-07 00:19:34] (main.py                 , line  94)=>   [*] [ep1/800]    Min/Last Recon Loss: 0.8008 0.8008,    Cost: 218.51s,    Remain: 2 days, 0:26:11,    Finish @ 03-09 00:45
[03-07 00:19:35] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 2:  [  0/313]  eta: 0:05:13  max_lr: 0.00033  last_loss: 0.7360 (0.7360)  orig_norm: 1.0495 (1.0495)  iter: 1.0022s  data: 0.0002s
[03-07 00:21:23] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 2:  [156/313]  eta: 0:01:48  max_lr: 0.00041  last_loss: 0.6857 (0.7064)  orig_norm: 1.0699 (1.0844)  iter: 0.6883s  data: 0.0004s
[03-07 00:23:11] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 2:  [312/313]  eta: 0:00:00  max_lr: 0.00049  last_loss: 0.6468 (0.6855)  orig_norm: 0.9105 (1.0421)  iter: 0.6949s  data: 0.0042s
[03-07 00:23:11] (ial/sparko/utils/misc.py, line 333)=> [PT] Epoch 2:   Total time:      0:03:36   (0.692 s / it)
[03-07 00:23:12] (main.py                 , line  94)=>   [*] [ep2/800]    Min/Last Recon Loss: 0.6852 0.6852,    Cost: 218.8s,    Remain: 2 days, 0:26:24,    Finish @ 03-09 00:49
[03-07 00:23:13] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 3:  [  0/313]  eta: 0:05:12  max_lr: 0.00049  last_loss: 0.6406 (0.6406)  orig_norm: 0.9049 (0.9049)  iter: 0.9995s  data: 0.0001s
[03-07 00:25:01] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 3:  [156/313]  eta: 0:01:48  max_lr: 0.00057  last_loss: 0.6155 (0.6270)  orig_norm: 0.6180 (0.7683)  iter: 0.6880s  data: 0.0004s
[03-07 00:26:49] (ial/sparko/utils/misc.py, line 314)=> [PT] Epoch 3:  [312/313]  eta: 0:00:00  max_lr: 0.00065  last_loss: 0.5913 (0.6138)  orig_norm: 0.3467 (0.6063)  iter: 0.6951s  data: 0.0044s
[03-07 00:26:49] (ial/sparko/utils/misc.py, line 333)=> [PT] Epoch 3:   Total time:      0:03:36   (0.692 s / it)
[03-07 00:26:50] (main.py                 , line  94)=>   [*] [ep3/800]    Min/Last Recon Loss: 0.6152 0.6152,    Cost: 218.65s,    Remain: 2 days, 0:20:45,    Finish @ 03-09 00:47
...

nvidia-smi should return something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM-80GB       On   | 00000000:10:00.0 Off |                    0 |
| N/A   67C    P0   323W / 400W |  33717MiB / 81252MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM-80GB       On   | 00000000:16:00.0 Off |                    0 |
| N/A   59C    P0   405W / 400W |  33695MiB / 81252MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM-80GB       On   | 00000000:4A:00.0 Off |                    0 |
| N/A   61C    P0   395W / 400W |  33695MiB / 81252MiB |     97%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM-80GB       On   | 00000000:4E:00.0 Off |                    0 |
| N/A   61C    P0   315W / 400W |  33683MiB / 81252MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM-80GB       On   | 00000000:89:00.0 Off |                    0 |
| N/A   60C    P0   336W / 400W |  33743MiB / 81252MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM-80GB       On   | 00000000:8E:00.0 Off |                    0 |
| N/A   60C    P0   325W / 400W |  33715MiB / 81252MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM-80GB       On   | 00000000:C5:00.0 Off |                    0 |
| N/A   55C    P0   311W / 400W |  33695MiB / 81252MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM-80GB       On   | 00000000:C9:00.0 Off |                    0 |
| N/A   65C    P0   155W / 400W |  33695MiB / 81252MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
trungpx commented 1 year ago

Concern.