ValueError: matrix contains invalid numeric entries

mit-han-lab / bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

https://bevfusion.mit.edu

Apache License 2.0

2.38k stars 428 forks source link

ValueError: matrix contains invalid numeric entries #540

Closed 1653042420 closed 4 months ago

1653042420 commented 1 year ago

Hi! Thanks for sharing your excellent work! when I trained the Lidar branch I only have one RTX 4080，and my batch size is 4. I encountered this issue when using the previously released code to train the LiDAR branch.Now,I know that it was caused by an incorrect learning rate. I want to konw if I use the latest released code, do I still need to adjust the learning rate based on the total batch size？

971022jing commented 10 months ago

Hi! Thanks for sharing your excellent work! when I trained the Lidar branch I only have one RTX 4080，and my batch size is 4. I encountered this issue when using the previously released code to train the LiDAR branch.Now,I know that it was caused by an incorrect learning rate. I want to konw if I use the latest released code, do I still need to adjust the learning rate based on the total batch size？

I have the same problem. Do you have any new progress?

nanqiang-zhangzhaoxu commented 10 months ago

I have the same problem. Do you have any new progress?

wyf0414 commented 8 months ago

I have the same problem, too. And when I increase the max_epoch, the corresponding lr needs to be smaller. I have to adjust the lr again and again.

gerardmartin2 commented 6 months ago

Hi, I have adjusted also the learning rate but in the 5th epoch it starts to slow down a lot. If you have made any modification in the lr schedule, can you show it? As my batch_size if 3 (approx 1/10 of the original) I have changed lr (from optimizer and min_lr_ratio) to 1/10 of original. Before this change my training was stucked at epoch2 and now it reaches epoch5, but as said, it starts to go too slow.

Thanks in advance

zyqww commented 4 months ago

@gerardmartin2 Hello, has the above method been successfully reproduced?Looking forward to your reply!

gerardmartin2 commented 4 months ago

Hello @zyqww, sorry for the delay. I had something like 1-2% peformance degradation with respect to the validation results (training only 4 epochs since, as mentioned, the training did not converge in the 5th). I am not sure that this lack of convergence is completely due to these modifications since I have modified some parts of the mmcv and mmdet libraries too. So maybe you can reach 6-8 epochs with these training modifications and get better results than me.

zhijian-liu commented 4 months ago

Thank you for your interest in our project. This repository is no longer actively maintained, so we will be closing this issue. Please refer to the amazing implementation at MMDetection3D. Thank you again!