imagegridworth / IG-VLM

BSD 3-Clause "New" or "Revised" License
124 stars 5 forks source link

Unable to Reproduce the reported numbers #3

Closed klauscc closed 7 months ago

klauscc commented 7 months ago

Hi authors,

Thanks for the great work! However, I cannot reproduce the numbers reported in the paper using your code. I use the LLaVA-1.6-vicuna-7B model.

Open-ended QA MSVD-QA MSRVTT-QA ActivityNet-QA TGIF-QA
Reported 78.8 63.7 54.3 73.0
Reproduced 74.0 60.1 48.5 68.5
Multichoice-QA NExT-QA Intent-QA EgoSchema
Reported 63.1 60.3 35.8
Reproduced 49.2 45.4 24.2

The multichoice-qa is not evaluated using Chat-GPT.

Could you re-run your experiments and see whether the reported numbers are reproducible? Thanks!

imagegridworth commented 7 months ago

Thank you for your interest.

First of all, we've conducted extensive validation, and it seems to be a reproducible situation. However, we've encountered instances where such phenomena occur on specific GPU machines. We suspect this might be due to the characteristics of the GPU computation or computation order settings. It's likely that the image benchmark evaluation of LLaVA v1.6 itself may not be properly reproducible(This is something we haven't confirmed for sure, so please take it into consideration.). Therefore, IG-VLM may also be affected.

Here are some suggestions we can propose:

  1. Try using a different GPU and ensure that it meets the requirements we provided, then rerun the experiments.
  2. The model available at https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf seems to provide a robust environment for the GPU computation we mentioned above.
  3. The results of the GPT-4V experiment are likely to be reproducible.

For now, multiple-choice evaluation doesn't require gpt-3-evaluation on our-work.

Thank you.

klauscc commented 7 months ago

Thanks for your quick responses!

  1. I tried using other GPUs and still got similar results. All results are way lower than your reported.
  2. The inference will be stuck if I load llava-hf/llava-v1.6-vicuna-7b-hf instead of the one hardcoded in your code liuhaotian/llava-v1.6-vicuna-7B.
  3. I ran GPT-4V on EgoSchema subset (500 questions). I can only get 55.4% accuracy, which is 4.4% lower than your reported number (59.8%).

For the GPT-4V inference, the only difference is the videos. Can you confirm that we are using the same videos? I downloaded the videos from googledrive shared by EgoSchema.

For your reference, I also calculate the md5 of a few files using md5sum:

>>> md5sum 00aa7247-9a4b-4667-8588-37df29d40fe8.mp4
d9c3f9b28ac8381c9caf16ea2008060b  00aa7247-9a4b-4667-8588-37df29d40fe8.mp4
>>> md5sum 1459809c-033a-4657-86a1-d106b6718d5f.mp4
2c38a72789f504eb49ea75f9ce6affcf  1459809c-033a-4657-86a1-d106b6718d5f.mp4

Could you double check if your reported results are reproducible? Thanks!

imagegridworth commented 7 months ago

Thank you for your check and interest.

We attempted the reproduction again and updated the code to ensure reproducibility. We have double-checked reproducibility with this code on a couple of machines. If you try again with the updated code, reproducibility should be possible.

When it comes to GPT-4v, you might encounter errors while using the GPT-4v API. We didn't receive such errors in our reproductions. Please clarify this point, and if you do encounter an error, that section should be requested again.

Thank you!

klauscc commented 7 months ago

Hi authors, Thanks for your quick response! I re-evaluated GPT-4v on EgoSchema twice using the newly updated code. I obtained accuracies of 57.2%, 59.0%. The second time is close to your reported numbers (59.8%). It seems the variance is large. It would be great if you can evaluate your method multiple times and report mean and std.

I will evaluate LLava later.

Thanks again!

imagegridworth commented 7 months ago

Thank you for the quick check and feedback!

We're hitting a success rate of 59.8% in our experiments, which is pretty good. There's some variance, of course, but it's not as you mentioned. In fact, sometimes it performs even better. I'll double-check this aspect again.

As I mentioned, you might get "I'm sorry, but the image provided does not contain enough information to ..." from gpt-4v api. (your prediction file might include it) it may reduce your accuracy for reproduction. It'll ok if you request it again. In our experiment, we don't get this. Please ensure it.

Thank you!