snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

The performance for video caption seems poor #50

Open Hyu-Zhang opened 2 months ago

Hyu-Zhang commented 2 months ago

Hello, I used the code and weights you provided to execute the inference.py file, but the results seem to be very different from what is shown. Do you know what is the reason for this please?

image image

tsaishien-chen commented 2 months ago

Hi @Hyu-Zhang, Thanks for your interest in our captioning algorithm and sorry for your inconvenience. This issue seems to be duplicate as https://github.com/snap-research/Panda-70M/issues/12. I hyposize this happens because you are using different tokenizer or LLM model. Did you follow this guideline to prepare vicuna-7b-v0 weight? Basically, you need to first download the original weight and apply delta weights. Could you please check for that? You can also check some issues (like this one) in FastChat repo for reference!