jxbbb / TOD3Cap

[ECCV 2024] TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
101 stars 5 forks source link

Code incomplete, no LiDAR branch!? #11

Open chreisinger opened 1 month ago

chreisinger commented 1 month ago

First of all thank you for sharing the dataset and code. However, I noticed that the provided code does not reflect the architecture published in the ECCV 2024 paper. After fixing some minor issues, the command tools/dist_train.sh starts the training. However, the config in projects/configs/bevformer/bevformer_tiny.py (as in the start script) does not contain a LiDAR branch but only the multi-view images, i.e., the BEVFormer model. Do I overlook here something or is the code incomplete? If not, could you please provide the correct config?

Best regards, Christian

SixCorePeach commented 1 month ago

When we try to organize the whole code presented already, we found the many error, the data of 'data/nuscenes/bevcap-bevformer-trainval_infos_temporal_train.pkl' is not included in, specially, if you have any solvement or dataset upload, it will be a good news.

chreisinger commented 1 month ago

'data/nuscenes/bevcap-bevformer-trainval_infos_temporal_train.pkl' is not included

This is not an issue, just use the mmdet3d code to generate these files: python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag bevcap-bevformer-trainval_infos_temporal

hhhharold commented 1 month ago

First of all thank you for sharing the dataset and code. However, I noticed that the provided code does not reflect the architecture published in the ECCV 2024 paper. After fixing some minor issues, the command tools/dist_train.sh starts the training. However, the config in projects/configs/bevformer/bevformer_tiny.py (as in the start script) does not contain a LiDAR branch but only the multi-view images, i.e., the BEVFormer model. Do I overlook here something or is the code incomplete? If not, could you please provide the correct config?

Best regards, Christian

I gave up on this incomplete code. I tried to replicate the results of this paper using the complete BEVFusion combined with Llama from the MMDetection3D repository, but the training results were terrible. Replacing Llama with Qwen didn't improve anything either. I also tried using other pure image datasets with a similar combination to describe attributes with Llama, and it worked well. This repository definitely deserves to be on the 'Papers Without Code' list!

chreisinger commented 3 weeks ago

@hhhharold If you are interested in exchanging thoughts and ideas on the issues, please get in touch with me. I would be interested in your BEVFusion implementation. Have you been able to consider the history features from the BEVFormer (not available in BEVFusion)? They might have a significant influence on the questions with temporal context.