Closed klauscc closed 7 months ago
Thank you for your interest.
First of all, we've conducted extensive validation, and it seems to be a reproducible situation. However, we've encountered instances where such phenomena occur on specific GPU machines. We suspect this might be due to the characteristics of the GPU computation or computation order settings. It's likely that the image benchmark evaluation of LLaVA v1.6 itself may not be properly reproducible(This is something we haven't confirmed for sure, so please take it into consideration.). Therefore, IG-VLM may also be affected.
Here are some suggestions we can propose:
For now, multiple-choice evaluation doesn't require gpt-3-evaluation on our-work.
Thank you.
Thanks for your quick responses!
llava-hf/llava-v1.6-vicuna-7b-hf
instead of the one hardcoded in your code liuhaotian/llava-v1.6-vicuna-7B
. For the GPT-4V inference, the only difference is the videos. Can you confirm that we are using the same videos? I downloaded the videos from googledrive shared by EgoSchema.
For your reference, I also calculate the md5 of a few files using md5sum
:
>>> md5sum 00aa7247-9a4b-4667-8588-37df29d40fe8.mp4
d9c3f9b28ac8381c9caf16ea2008060b 00aa7247-9a4b-4667-8588-37df29d40fe8.mp4
>>> md5sum 1459809c-033a-4657-86a1-d106b6718d5f.mp4
2c38a72789f504eb49ea75f9ce6affcf 1459809c-033a-4657-86a1-d106b6718d5f.mp4
Could you double check if your reported results are reproducible? Thanks!
Thank you for your check and interest.
We attempted the reproduction again and updated the code to ensure reproducibility. We have double-checked reproducibility with this code on a couple of machines. If you try again with the updated code, reproducibility should be possible.
When it comes to GPT-4v, you might encounter errors while using the GPT-4v API. We didn't receive such errors in our reproductions. Please clarify this point, and if you do encounter an error, that section should be requested again.
Thank you!
Hi authors, Thanks for your quick response! I re-evaluated GPT-4v on EgoSchema twice using the newly updated code. I obtained accuracies of 57.2%, 59.0%. The second time is close to your reported numbers (59.8%). It seems the variance is large. It would be great if you can evaluate your method multiple times and report mean and std.
I will evaluate LLava later.
Thanks again!
Thank you for the quick check and feedback!
We're hitting a success rate of 59.8% in our experiments, which is pretty good. There's some variance, of course, but it's not as you mentioned. In fact, sometimes it performs even better. I'll double-check this aspect again.
As I mentioned, you might get "I'm sorry, but the image provided does not contain enough information to ..." from gpt-4v api. (your prediction file might include it) it may reduce your accuracy for reproduction. It'll ok if you request it again. In our experiment, we don't get this. Please ensure it.
Thank you!
Hi authors,
Thanks for the great work! However, I cannot reproduce the numbers reported in the paper using your code. I use the LLaVA-1.6-vicuna-7B model.
The multichoice-qa is not evaluated using Chat-GPT.
Could you re-run your experiments and see whether the reported numbers are reproducible? Thanks!