训练结果问题，训练过程出现grad_norm=nan

PrymceQ commented 1 year ago

陈述一些细节

4张Nvidia 3090跑的;
跟着这个Issue修改了下配置，Flash-Attn issue。

其他都没有更改过。训练了20个epoch后发现的问题：

在6/7个epoch训练时出现grad_norm=nan，且loss从15左右上升至最后40左右。
结果很糟糕，基本上等于没有训练。

junjie18 commented 1 year ago

@PrymceQ 3090建议用flash attn，没必要修改配置。然后你的问题应该可以通过把max learning rate调小解决。另外你用4卡就相当于batch size减半了，理论上应该调小learning rate。

PrymceQ commented 1 year ago

好的，我调小lr先试一试

changxu-zhang commented 1 year ago

Hi,

I also got similar result after training with fusion config 1600*640 on mini dataset for 20 epochs. Loss remained unchanged since 5. epoch with 28. I worked with single A6000 so that should probably not a flashattn problem.

Could you please give some hints about tuning like modifying learning rate or batch size?

Thanks

junjie18 commented 1 year ago

@changxu-zhang Try to replace with 8 to <4 here? To make the max learning rate lower than where the training is broken.

changxu-zhang commented 1 year ago

@junjie18 Thanks for your reply

@changxu-zhang Try to replace with 8 to <4 here? To make the max learning rate lower than where the training is broken.

I trained it with target_ratio=(4, 0.0001) this time, but the result still seems to be unsatisfied. After 20 epochs, loss ended up with 20 and learning rate with 3.031e-08. It went initially up from 1e-4 to 4e-4 and then down to 3.031e-08. Is that a problem of learning rate or batch size? (I'm not sure where batch_size can be modified, just found one here.)

Thanks

Screenshot from 2023-07-15 17-56-08

junjie18 commented 1 year ago

@changxu-zhang samples_per_gpu is the batch size on a single gpu. And can you provide a training loss curve?

PrymceQ commented 1 year ago

这里只是记录一下试验结果，设置了optimizer的lr从0.00014修改值至0.00007，但仍然出现grad [norm变成nan的情况，但是出现nan的epoch从原本的6-7到现在第9个epoch经历lr=0.00007*6=0.00042过程才出现。（另外这边解释一下，由于GPU的限制，我们使用的参数是CMT/projects/configs/fusion/cmt_voxel0100_r50_800x320_cbgs.py）

然后我们测试了第8个epoch的训练权重，发现结果合理。 2023-07-17 11-14-50 的屏幕截图

但第16个epoch仍然出现了训练问题，应该是梯度爆炸了吧。 2023-07-17 11-36-42 的屏幕截图

接下来进一步target_ratio=(3, 0.0001)，从6到3试试。

junjie18 commented 1 year ago

@PrymceQ @changxu-zhang You can also try to set loss_scale='dynamic' in the optimizer_config. To avoid overflow in FP16.

PrymceQ commented 1 year ago

第9轮开始出现了grad norm，但整体的结果正常了。 2023-07-24 10-20-45 的屏幕截图

SISTMrL commented 1 year ago

第9轮开始出现了grad norm，但整体的结果正常了。

你好你最终训练的结果有达到repo里的67.9吗

sun-0704 commented 1 year ago

请问target_ratio=(3, 0.0001)这个参数具体有什么含义

DanielZ98 commented 8 months ago

请问target_ratio=(3, 0.0001)这个参数具体有什么含义 https://arxiv.org/abs/1506.01186 https://github.com/open-mmlab/mmcv/blob/v1.6.0/mmcv/runner/hooks/lr_updater.py#L234

daxiongpro commented 6 months ago

hello, 请问你的最终配置是啥？包括 lr 和 target_ratio。还有load_from 是imgbackbone 还是作者整个训练好的权重？ @PrymceQ

daxiongpro commented 4 months ago

我感觉出现 grad_norm=nan 的罪魁祸首在配置文件中的 optimizer_config = dict( type='CustomFp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2), custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

我在另一个项目中复现 CMT，刚开始训练就出现 grad_norm=nan，loss 不下降。但是把 CustomFp16OptimizerHook 删了 loss 就能正常下降了。

zzy-ucas commented 2 months ago

我感觉出现 grad_norm=nan 的罪魁祸首在配置文件中的 optimizer_config = dict( type='CustomFp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2), custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

我在另一个项目中复现 CMT，刚开始训练就出现 grad_norm=nan，loss 不下降。但是把 CustomFp16OptimizerHook 删了 loss 就能正常下降了。

@daxiongpro 太感谢，同样遇到此问题，请问，删除CustomFp16OptimizerHook具体如何操作？默认是：

optimizer_config = dict(
    type='CustomFp16OptimizerHook',
    loss_scale='dynamic',
    grad_clip=dict(max_norm=35, norm_type=2),
    custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

要改为啥呢？

junjie18 / CMT

训练结果问题，训练过程出现grad_norm=nan #48