junjie18 / CMT

[ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
Other
331 stars 37 forks source link

训练结果问题,训练过程出现grad_norm=nan #48

Open PrymceQ opened 1 year ago

PrymceQ commented 1 year ago

陈述一些细节

  1. 4张Nvidia 3090跑的;
  2. 跟着这个Issue修改了下配置,Flash-Attn issue

其他都没有更改过。 训练了20个epoch后发现的问题:

  1. 在6/7个epoch训练时出现grad_norm=nan,且loss从15左右上升至最后40左右。 2023-07-11 15-24-35 的屏幕截图

  2. 结果很糟糕,基本上等于没有训练。 2023-07-11 15-24-21 的屏幕截图

junjie18 commented 1 year ago

@PrymceQ 3090建议用flash attn,没必要修改配置。 然后你的问题应该可以通过把max learning rate调小解决。另外你用4卡就相当于batch size减半了,理论上应该调小learning rate。

PrymceQ commented 1 year ago

好的,我调小lr先试一试

changxu-zhang commented 1 year ago

Hi,

I also got similar result after training with fusion config 1600*640 on mini dataset for 20 epochs. Loss remained unchanged since 5. epoch with 28. I worked with single A6000 so that should probably not a flashattn problem.

Could you please give some hints about tuning like modifying learning rate or batch size?

Thanks

junjie18 commented 1 year ago

@changxu-zhang Try to replace with 8 to <4 here? To make the max learning rate lower than where the training is broken.

changxu-zhang commented 1 year ago

@junjie18 Thanks for your reply

@changxu-zhang Try to replace with 8 to <4 here? To make the max learning rate lower than where the training is broken.

I trained it with target_ratio=(4, 0.0001) this time, but the result still seems to be unsatisfied. After 20 epochs, loss ended up with 20 and learning rate with 3.031e-08. It went initially up from 1e-4 to 4e-4 and then down to 3.031e-08. Is that a problem of learning rate or batch size? (I'm not sure where batch_size can be modified, just found one here.)

Thanks

Screenshot from 2023-07-15 17-56-08

junjie18 commented 1 year ago

@changxu-zhang samples_per_gpu is the batch size on a single gpu. And can you provide a training loss curve?

PrymceQ commented 1 year ago

这里只是记录一下试验结果,设置了optimizer的lr从0.00014修改值至0.00007,但仍然出现grad [norm变成nan的情况,但是出现nan的epoch从原本的6-7到现在第9个epoch经历lr=0.00007*6=0.00042过程才出现。(另外这边解释一下,由于GPU的限制,我们使用的参数是CMT/projects/configs/fusion/cmt_voxel0100_r50_800x320_cbgs.py)

然后我们测试了第8个epoch的训练权重,发现结果合理。 2023-07-17 11-14-50 的屏幕截图

但第16个epoch仍然出现了训练问题,应该是梯度爆炸了吧。 2023-07-17 11-36-42 的屏幕截图

接下来进一步target_ratio=(3, 0.0001),从6到3试试。

junjie18 commented 1 year ago

@PrymceQ @changxu-zhang You can also try to set loss_scale='dynamic' in the optimizer_config. To avoid overflow in FP16.

PrymceQ commented 1 year ago

第9轮开始出现了grad norm,但整体的结果正常了。 2023-07-24 10-20-45 的屏幕截图

SISTMrL commented 1 year ago

第9轮开始出现了grad norm,但整体的结果正常了。 2023-07-24 10-20-45 的屏幕截图

你好你最终训练的结果有达到repo里的67.9吗

sun-0704 commented 1 year ago

请问target_ratio=(3, 0.0001)这个参数具体有什么含义

DanielZ98 commented 8 months ago

请问target_ratio=(3, 0.0001)这个参数具体有什么含义 https://arxiv.org/abs/1506.01186 https://github.com/open-mmlab/mmcv/blob/v1.6.0/mmcv/runner/hooks/lr_updater.py#L234

daxiongpro commented 6 months ago

hello, 请问你的最终配置是啥?包括 lr 和 target_ratio。还有load_from 是imgbackbone 还是作者整个训练好的权重? @PrymceQ

daxiongpro commented 4 months ago

我感觉出现 grad_norm=nan 的罪魁祸首在配置文件中的 optimizer_config = dict( type='CustomFp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2), custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

我在另一个项目中复现 CMT,刚开始训练就出现 grad_norm=nan,loss 不下降。但是把 CustomFp16OptimizerHook 删了 loss 就能正常下降了。

zzy-ucas commented 2 months ago

我感觉出现 grad_norm=nan 的罪魁祸首在配置文件中的 optimizer_config = dict( type='CustomFp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2), custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

我在另一个项目中复现 CMT,刚开始训练就出现 grad_norm=nan,loss 不下降。但是把 CustomFp16OptimizerHook 删了 loss 就能正常下降了。

@daxiongpro 太感谢,同样遇到此问题,请问,删除CustomFp16OptimizerHook具体如何操作? 默认是:

optimizer_config = dict(
    type='CustomFp16OptimizerHook',
    loss_scale='dynamic',
    grad_clip=dict(max_norm=35, norm_type=2),
    custom_fp16=dict(pts_voxel_encoder=False, pts_middle_encoder=False, pts_bbox_head=False))

要改为啥呢?