Closed JingyangXiang closed 3 months ago
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃
Hardware Environment(Ascend) / 硬件环境:
Ascend
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :
Excute Mode / 执行模式 (Mandatory / 必填)(Graph):
Graph
To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况
Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛
Screenshots/ 日志 / 截图 (Mandatory / 必填)
猜测:
if self.clip_grad: grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value)
梯度裁剪应当放在梯度的all_reduce之后,修改后未出现这种崩溃现象 数学原因:
duplicated of #603
We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃
Hardware Environment(
Ascend
) / 硬件环境:Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :
Excute Mode / 执行模式 (Mandatory / 必填)(
Graph
):To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况
Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛
Screenshots/ 日志 / 截图 (Mandatory / 必填)
猜测:
todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed?
梯度裁剪应当放在梯度的all_reduce之后,修改后未出现这种崩溃现象 数学原因: