JingyangXiang commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃

Hardware Environment(Ascend) / 硬件环境:
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :
Excute Mode / 执行模式 (Mandatory / 必填)(Graph):

To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况

Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛

Screenshots/ 日志 / 截图 (Mandatory / 必填) 917efa5d6cb67d9db0bbabc89125457

猜测：

todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed?

    if self.clip_grad:
        grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value)

梯度裁剪应当放在梯度的all_reduce之后，修改后未出现这种崩溃现象数学原因：

先clip 有的会clip 有的不会clip 然后再求all_redue（目前）
先求mean，然后整体clip 目前的方案可能存在梯度整体向量方向有偏差的问题，导致对于梯度敏感的模型训练出现问题

geniuspatrick commented 1 year ago

duplicated of #603

geniuspatrick commented 1 year ago

We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.

mindspore-lab / mindcv

模型训练过程中训练模式突然崩溃 #728

todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed?