Clarification on Training of 'swint-nuimages-pretrained.pth'

I am currently working on training the C+L BEVFusion model and have encountered some confusion regarding the checkpoints being used during the process.

It appears that the training procedure involves using a combination of a lidar-only model and a pretrained camera model. Specifically, the checkpoints utilized are:

Lidar-only model ( lidar-only-det.pth)
Pretrained camera model (swint-nuimages-pretrained.pth)

However, I noticed that the combination does not involve the camera-only model along with the lidar-only model, which seems to be a logical choice for such fusion models.

Could you please provide detailed information on how the swint-nuimages-pretrained.pth is being trained? Understanding the training methodology behind this pretrained camera model will greatly help in comprehending its integration within the C+L BEVFusion model.

Thanks!

mit-han-lab / bevfusion

Clarification on Training of 'swint-nuimages-pretrained.pth' #625