hollow-503 / UniM2AE

[ECCV2024] UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving
Apache License 2.0
40 stars 0 forks source link

reproducing results #5

Closed konyul closed 10 months ago

konyul commented 11 months ago

Hi, Thanks for sharing such a great work. I am struggling to reproduce the camera+lidar results. what i did is as follows.

  1. evaluating the results from ckpt works fine. this means the model structure I cloned is nothing to deal with.
  2. when it goes to training the model, the problem happens. When I start to train the model loaded from lidar-only ckpt(reproduced, and work fine), The performance begins to drop. I am thinking about the reason for the problem as follows 2.1. learning rate problem. I use 4 gpus and the batch_size for a gpu is 1. So I think the lr should be 1/8 of original lr (because the original gpubatch size is 84). 2.2 batch size problem. I can not raise the batch size into 2, because of the memory of gpu.

I am planning to train the lidar+camera model with lr/8. And is there any suggested modification of training schedule I should follow?

hollow-503 commented 10 months ago

Sorry for the late reply. Actually, someone also encountered the same problem when training the lidar-only model in fine-tuning. I think it might be the downstream model setting issue. Maybe you can refer to the BEVFusion author's answer like issue#296, or TransFusion author's answer. I think this may be due to training instability from FP16. You can try lower lr or training with FP32.

Since the model in the fine-tuning is modified from BEVFusion, you could also check their repo to check if they have a discussion on this. I remember someone from BEVFusion or Transfusion's repo asking the similar question, but I'm sorry I forget exactly which one asked it.

hollow-503 commented 10 months ago

Closed due to inactivity. Please feel free to reopen if you feel it necessary.