mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
https://bevfusion.mit.edu
Apache License 2.0
2.26k stars 409 forks source link

LR #174

Closed jiapeng789 closed 1 year ago

jiapeng789 commented 1 year ago

Hello, looking at your previous reply, you used 8 GPUs to train the model. When I used two 3090gpus to train the model, the gradient exploded. I suspect that the learning rate was too large. When training the model with different numbers of GPUs, how should the learning rate and epochs be adjusted? For me, is it necessary to adjust the learning rate to 1/4 of the principle and adjust the epochs to 30 ? Looking forward to hearing from you.

kentang-mit commented 1 year ago

I haven't tried training with fewer than 8 GPUs, but you can have a try reducing the learning rate. By the way, gradient norm = nan at the starting phase is normal, you can wait for several thousands of iterations and see whether the gradient norm can be stable. This is because we used a gradient scaler that can adaptively change the scaling ratio.

jiapeng789 commented 1 year ago

Thank you for your reply! I set the initial learning rate of the optimizer in the lidar-only object detection model to 1/2 of the original (from 1.0e-4 to 5.0e-5), epochs=20, and other things remain the same, and there is no gradient explosion during training. Now that I have trained for 12 epochs, the evaluation results of the model on the validation set tend to be stable, mAP≈0.5042, NDS≈0.5962, which is still far from the results published in the paper (mAP=64.68). I don't know if this result is caused by the small learning rate setting. Do you have any suggestions for model training?

kentang-mit commented 1 year ago

Actually my schedule is almost the same as the official TransFusion paper (probably the difference is that they manually restart at epoch 15). And the schedule itself is widely used by almost all 3D object detection papers. It would be a bit hard for me to provide an alternative schedule quickly that can work equally well with a small number of GPUs. But what I can say is that the GT paste fading strategy (gt_paste_stop_epoch: 15 in our configuration) is going to bring about ~5mAP improvement over the baseline. So some of the gap can be explained when you used a shorter schedule. But this strategy does not matter that much for the fusion models.

jiapeng789 commented 1 year ago

Thank you for your reply. The training of the lidar-only object detection model was done with two GPUs, mAP=0.6344, NDS=0.6855, although there is a gap with your published results (mAP=0.6468, NDS=0.6828), I think it can be improved by adjusting the learning. Now, I am training the Lidar + Camera fusion object detection algorithm on 3090GPU, but encounter the following problems:

  1. When the input image resolution is 256704, an error is reported: RuntimeError: CUDA out of memory, Tried to allocate 2.0 GiB, I adjusted batchsize=1, but the exact same error is still reported; when the input image resolution is set to 128352 , batchsize=4, the model can be trained normally. But I'm a little confused, when I set batchsize from 4 to 1, I get the exact same error (Tried to allocate 2.0 GiB), so I'm not sure if this error is really caused by out of memory? Or some other reason?
  2. During the training of the lidar+camera fusion detection model, did you use 128*352 resolution images as input? If you have done the same experiment, I hope you can publish the results.
  3. As you said in other questions, this study uses the same fusion model training method as TransFusion, TransFusion sets the training epochs=6 for the fusion stage, and you set the training epochs=6 for the camera+lidar fusion? Or epochs=20?

Looking forward to hearing from you, thank you!

kentang-mit commented 1 year ago

Hi @jiajiaen,

Seems that the LiDAR-only results are much better now. I guess if you cannot close the gap, it might be related to the number of GPUs.

  1. For fusion models, we used 256x704 by default, and this resolution is already smaller than other papers (I remember that TransFusion, DeepInteraction, PointAugmenting all used 448x800). 3090 Should be able to hold batch size=1 (we actually used mixed precision training). But in our experiments we mainly used A6000 with 48G memory. To reduce GPU memory consumption, you can potentially freeze the camera backbone, according to our ablation studies in the paper the accuracy degradation is not quite large in a reduced schedule.
  2. That's not in our main results so I never trained it for a full schedule.
  3. Yes we finetune for six epochs. Configs will be released by the end of next month if everything goes on well.

Best, Haotian

jiapeng789 commented 1 year ago

Hi @kentang-mit

When setting the image input resolution to [128, 352], the effect after fine-tuning is very poor, now, I reset the image input resolution to [256, 704], but accidentally found a puzzling problem, when using "swin_tiny_patch4_window7_224. pth" to initialize the camera backbone, the weight value in the weight file is the same as the weight value after the camera backbone is initialized. However, when using "--model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.py" to initialize the camera backbone, the weights after model initialization are not the same as those in "swint-nuimages-pretrained.py" , which means that the pretrained weights did not successfully initialize the camera backbone model. I don't know the cause of this problem, have you encountered a similar problem.

Best, Jiapeng

kentang-mit commented 1 year ago

That's interesting. Is it possible for you to share the warning returned by mmcv? Probably you also need to check whether your mmcv version is the same as ours (1.4.0).

jiapeng789 commented 1 year ago

I checked the created conda environment, mmcv-full==1.4.0, the versions of other dependencies also meet the requirements. The camera backbone model did not report any warnings or errors during the process of loading pre-trained weights. The camera backbone can successfully load the weight information in the "swin_tiny_patch4_window7_224.pth" file but cannot load the weight information in the "swint-nuimages-pretrained.pth" file, when loading the pretrained weights in "swint-nuimages-pretrained.pth" , the camera backbone will be randomly initialized (instead of using the weights in "swint-nuimages-pretrained.pth"). In addition, I did the same verification in the docker environment, the camera backbone can successfully load the weight information in the "swin_tiny_patch4_window7_224.pth" file but cannot load the weight information in the "swint-nuimages-pretrained.pth" file. I'm not sure how to fix this, is there a possibility that the "swint-nuimages-pretrained.pth" weights file is wrong.

Best, Jiapeng

kentang-mit commented 1 year ago

I will investigate it. Please stay tuned.

jiapeng789 commented 1 year ago

Hi @kentang-mit When training the fusion detection model, I tried a variety of methods a still couldn't use the pre-trained "swint-nuimages-pretrained.pth" to initialize the camera backbone. If "swin_tiny_patch4_window7_224.pth" is used as the pre-training weight of the camera backbone, how much difference will the final training accuracy be?

Best, Jiapeng

kentang-mit commented 1 year ago

The accuracy difference will be very small. I expect that difference to be within 0.2% in mAP.

zlenyk commented 1 year ago

@kentang-mit could you clarify if you're scaling LR with batch size? Or are you using just LR = 1e-4 with AdamW for batch size of 32 (8GPUs * 4)

Thank you!

kentang-mit commented 1 year ago

I did not try scaling LR with batch size in my experiments but I did try several starting lrs in my experiments. It seems that the results are relatively stable w.r.t the starting LR.

kentang-mit commented 1 year ago

Closed due to inactivity.