open-mmlab / mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
https://mmagic.readthedocs.io/en/latest/
Apache License 2.0
6.89k stars 1.06k forks source link

Too long data_time in training of denoise and deblur models. And too low GPU utilization #2130

Open GeLeinjust opened 6 months ago

GeLeinjust commented 6 months ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmagic

Environment

pytorch 11.7 others as requirements.txt

Reproduces the problem - code sample

You can reproduce the problem by training nafnet for denoise on SIDD and deblur on GoPro.

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/nafnet/nafnet_c64eb2248mb12db2222_8xb8-lr1e-3-400k_sidd.py \ --work-dir ./work_dirs/naf_sidd \ --auto-scale-lr \ --amp \

Reproduces the problem - error message

This is part of the log of nafnet trained on SIDD.

03/20 11:49:02 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530 03/20 11:49:02 - mmengine - INFO - Iter(train) [ 40/20000] lr: 2.0000e-03 memory: 6946 data_time: 4.7464 loss: -18.6554 time: 5.1101 03/20 11:53:11 - mmengine - INFO - Iter(train) [ 100/20000] lr: 1.9999e-03 eta: 1 day, 1:03:44 time: 4.5339 data_time: 4.2340 memory: 6946 loss: -25.1152 03/20 11:59:38 - mmengine - INFO - Iter(train) [ 200/20000] lr: 1.9995e-03 eta: 23:06:41 time: 3.8703 data_time: 3.6130 memory: 6946 loss: -33.5698 03/20 12:05:44 - mmengine - INFO - Iter(train) [ 300/20000] lr: 1.9989e-03 eta: 22:00:53 time: 3.6648 data_time: 3.4076 memory: 6946 loss: -34.8641 03/20 12:12:13 - mmengine - INFO - Iter(train) [ 400/20000] lr: 1.9980e-03 eta: 21:42:56 time: 3.8854 data_time: 3.6273 memory: 6946 loss: -35.0379 03/20 12:19:02 - mmengine - INFO - Iter(train) [ 500/20000] lr: 1.9969e-03 eta: 21:42:59 time: 4.0917 data_time: 3.8339 memory: 6946 loss: -35.8696 03/20 12:25:20 - mmengine - INFO - Iter(train) [ 600/20000] lr: 1.9956e-03 eta: 21:23:50 time: 3.7777 data_time: 3.5199 memory: 6946 loss: -37.0751 03/20 12:31:35 - mmengine - INFO - Iter(train) [ 700/20000] lr: 1.9940e-03 eta: 21:06:59 time: 3.7479 data_time: 3.4884 memory: 6946 loss: -37.3975 03/20 12:37:41 - mmengine - INFO - Iter(train) [ 800/20000] lr: 1.9921e-03 eta: 20:49:35 time: 3.6678 data_time: 3.4098 memory: 6946 loss: -37.6647 03/20 12:44:41 - mmengine - INFO - Iter(train) [ 900/20000] lr: 1.9900e-03 eta: 20:53:17 time: 4.1938 data_time: 3.9359 memory: 6946 loss: -37.7504 03/20 12:51:04 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530 03/20 12:51:04 - mmengine - INFO - Iter(train) [ 1000/20000] lr: 1.9877e-03 eta: 20:43:17 time: 3.8284 data_time: 3.5698 memory: 6946 loss: -38.3180 03/20 12:51:04 - mmengine - INFO - Saving checkpoint at 1000 iterations 03/20 12:57:26 - mmengine - INFO - Iter(train) [ 1100/20000] lr: 1.9851e-03 eta: 20:33:44 time: 3.8210 data_time: 3.5634 memory: 6946 loss: -38.2264 03/20 13:03:55 - mmengine - INFO - Iter(train) [ 1200/20000] lr: 1.9823e-03 eta: 20:26:41 time: 3.8970 data_time: 3.6391 memory: 6946 loss: -38.6776 03/20 13:10:27 - mmengine - INFO - Iter(train) [ 1300/20000] lr: 1.9793e-03 eta: 20:20:17 time: 3.9197 data_time: 3.6504 memory: 6946 loss: -38.9627 03/20 13:17:17 - mmengine - INFO - Iter(train) [ 1400/20000] lr: 1.9760e-03 eta: 20:17:44 time: 4.0952 data_time: 3.8379 memory: 6946 loss: -39.0869 03/20 13:23:24 - mmengine - INFO - Iter(train) [ 1500/20000] lr: 1.9724e-03 eta: 20:05:50 time: 3.6675 data_time: 3.4088 memory: 6946 loss: -39.0055 03/20 13:29:31 - mmengine - INFO - Iter(train) [ 1600/20000] lr: 1.9686e-03 eta: 19:54:41 time: 3.6693 data_time: 3.4121 memory: 6946 loss: -39.3286 03/20 13:36:10 - mmengine - INFO - Iter(train) [ 1700/20000] lr: 1.9646e-03 eta: 19:49:54 time: 3.9914 data_time: 3.7340 memory: 6946 loss: -39.4060 03/20 13:42:56 - mmengine - INFO - Iter(train) [ 1800/20000] lr: 1.9603e-03 eta: 19:46:11 time: 4.0665 data_time: 3.8090 memory: 6946 loss: -39.6298 03/20 13:48:56 - mmengine - INFO - Iter(train) [ 1900/20000] lr: 1.9558e-03 eta: 19:34:42 time: 3.5981 data_time: 3.3417 memory: 6946 loss: -39.6971 03/20 13:55:22 - mmengine - INFO - Exp name: nafnet_c64eb2248mb12db2222_lr1e-3_400k_sidd_20240320_114530 03/20 13:55:22 - mmengine - INFO - Iter(train) [ 2000/20000] lr: 1.9511e-03 eta: 19:27:38 time: 3.8557 data_time: 3.5993 memory: 6946 loss: -39.6293 03/20 13:55:22 - mmengine - INFO - Saving checkpoint at 2000 iterations 03/20 13:55:36 - mmengine - INFO - Iter(val) [ 100/1280] eta: 0:02:26 time: 0.1238 data_time: 0.0358 memory: 825
03/20 13:55:46 - mmengine - INFO - Iter(val) [ 200/1280] eta: 0:01:48 time: 0.1008 data_time: 0.0173 memory: 825
03/20 13:55:57 - mmengine - INFO - Iter(val) [ 300/1280] eta: 0:01:42 time: 0.1049 data_time: 0.0184 memory: 825
03/20 13:56:07 - mmengine - INFO - Iter(val) [ 400/1280] eta: 0:01:27 time: 0.0992 data_time: 0.0172 memory: 825
03/20 13:56:16 - mmengine - INFO - Iter(val) [ 500/1280] eta: 0:01:12 time: 0.0930 data_time: 0.0155 memory: 825
03/20 13:56:25 - mmengine - INFO - Iter(val) [ 600/1280] eta: 0:01:03 time: 0.0934 data_time: 0.0152 memory: 825
03/20 13:56:35 - mmengine - INFO - Iter(val) [ 700/1280] eta: 0:00:54 time: 0.0941 data_time: 0.0153 memory: 825
03/20 13:56:44 - mmengine - INFO - Iter(val) [ 800/1280] eta: 0:00:46 time: 0.0959 data_time: 0.0154 memory: 825
03/20 13:56:54 - mmengine - INFO - Iter(val) [ 900/1280] eta: 0:00:36 time: 0.0973 data_time: 0.0158 memory: 825
03/20 13:57:04 - mmengine - INFO - Iter(val) [1000/1280] eta: 0:00:27 time: 0.0990 data_time: 0.0160 memory: 825
03/20 13:57:14 - mmengine - INFO - Iter(val) [1100/1280] eta: 0:00:17 time: 0.0994 data_time: 0.0155 memory: 825
03/20 13:57:24 - mmengine - INFO - Iter(val) [1200/1280] eta: 0:00:07 time: 0.0991 data_time: 0.0155 memory: 825
03/20 13:57:33 - mmengine - INFO - Iter(val) [1280/1280] MAE: 0.0126 PSNR: 36.7611 SSIM: 0.8853 data_time: 0.0178 time: 0.1006 03/20 13:57:34 - mmengine - INFO - The best checkpoint with 36.7611 PSNR at 2000 iter is saved to best_PSNR_iter_2000.pth.

Additional information

I want to know if set 'pin_memory=True' would be useful.