mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
235 stars 143 forks source link

模型训练过程中训练模式突然崩溃 #728

Closed JingyangXiang closed 3 months ago

JingyangXiang commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃

To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况

Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛

Screenshots/ 日志 / 截图 (Mandatory / 必填) 917efa5d6cb67d9db0bbabc89125457

猜测:

todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed?

    if self.clip_grad:
        grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value)

梯度裁剪应当放在梯度的all_reduce之后,修改后未出现这种崩溃现象 数学原因:

  1. 先clip 有的会clip 有的不会clip 然后再求all_redue(目前)
  2. 先求mean,然后整体clip 目前的方案可能存在梯度整体向量方向有偏差的问题,导致对于梯度敏感的模型训练出现问题
geniuspatrick commented 1 year ago

duplicated of #603

geniuspatrick commented 1 year ago

We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.