mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
231 stars 140 forks source link

[xception] [Ascend910] [GRAPH] Unable to reproduce precision #716

Closed 787918582 closed 1 month ago

787918582 commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) xception边训边推过程中精度异常

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/xception/xception_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 复现达标精度

Screenshots/ 日志 / 截图 (Mandatory / 必填) [2023-07-18 11:49:54] mindcv.utils.callbacks INFO - Epoch: [195/200], batch: [5004/5004], loss: 6.885725, lr: 0.000073, time: 424.939602s [2023-07-18 11:50:08] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.2020%, Top_5_Accuracy: 0.9740%, time: 14.200049s [2023-07-18 11:50:09] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-195_5004.ckpt [2023-07-18 11:50:10] mindcv.utils.callbacks INFO - Total time since last epoch: 441.484016(train: 424.948293, val: 14.200049)s, ETA: 2207.420082s [2023-07-18 11:50:10] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 11:57:15] mindcv.utils.callbacks INFO - Epoch: [196/200], batch: [5004/5004], loss: 6.873163, lr: 0.000047, time: 424.949698s [2023-07-18 11:57:34] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.1920%, Top_5_Accuracy: 0.9760%, time: 18.866158s [2023-07-18 11:57:35] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-196_5004.ckpt [2023-07-18 11:57:36] mindcv.utils.callbacks INFO - Total time since last epoch: 446.140194(train: 424.957141, val: 18.866158)s, ETA: 1784.560777s [2023-07-18 11:57:36] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 12:04:41] mindcv.utils.callbacks INFO - Epoch: [197/200], batch: [5004/5004], loss: 6.878728, lr: 0.000026, time: 424.929116s [2023-07-18 12:05:00] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.2020%, Top_5_Accuracy: 1.0020%, time: 18.733638s [2023-07-18 12:05:01] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-197_5004.ckpt [2023-07-18 12:05:02] mindcv.utils.callbacks INFO - Total time since last epoch: 446.031272(train: 424.937488, val: 18.733638)s, ETA: 1338.093816s [2023-07-18 12:05:02] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 12:12:07] mindcv.utils.callbacks INFO - Epoch: [198/200], batch: [5004/5004], loss: 6.857450, lr: 0.000012, time: 424.953714s [2023-07-18 12:12:23] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.2040%, Top_5_Accuracy: 1.0260%, time: 15.403413s [2023-07-18 12:12:24] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-198_5004.ckpt [2023-07-18 12:12:25] mindcv.utils.callbacks INFO - Total time since last epoch: 442.684698(train: 424.961049, val: 15.403413)s, ETA: 885.369397s [2023-07-18 12:12:25] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 12:19:30] mindcv.utils.callbacks INFO - Epoch: [199/200], batch: [5004/5004], loss: 6.874672, lr: 0.000003, time: 424.949396s [2023-07-18 12:19:44] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.1960%, Top_5_Accuracy: 1.0020%, time: 13.526417s [2023-07-18 12:19:45] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-199_5004.ckpt [2023-07-18 12:19:46] mindcv.utils.callbacks INFO - Total time since last epoch: 440.837706(train: 424.957480, val: 13.526417)s, ETA: 440.837706s [2023-07-18 12:19:46] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 12:26:51] mindcv.utils.callbacks INFO - Epoch: [200/200], batch: [5004/5004], loss: 6.879807, lr: 0.000000, time: 424.909221s [2023-07-18 12:27:09] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 0.2020%, Top_5_Accuracy: 1.0140%, time: 18.529595s [2023-07-18 12:27:11] mindcv.utils.callbacks INFO - Saving model to ./ckpt/xception-200_5004.ckpt [2023-07-18 12:27:12] mindcv.utils.callbacks INFO - Total time since last epoch: 445.814416(train: 424.917451, val: 18.529595)s, ETA: 0.000000s [2023-07-18 12:27:12] mindcv.utils.callbacks INFO - -------------------------------------------------------------------------------- [2023-07-18 12:27:12] mindcv.utils.callbacks INFO - Finish training! [2023-07-18 12:27:12] mindcv.utils.callbacks INFO - The best validation Top_1_Accuracy is: 0.8380% at epoch 1. [2023-07-18 12:27:12] mindcv.utils.callbacks INFO - ================================================================================

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.

tacyi commented 7 months ago

MindSpore_v2.2.10.B180 训练有同样的问题