shirley-wu / vdebugger

[EMNLP2024 Findings] VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
https://shirley-wu.github.io/vdebugger/
Apache License 2.0
3 stars 1 forks source link

Inquiry Regarding Discrepancy in Final Accuracy Results #1

Closed aidialogue closed 1 month ago

aidialogue commented 1 month ago

Hello! While executing the commands: CONFIG_NAMES=execute/gqa python main_batch_execute.py CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP I noticed that the final accuracy obtained from the second command is consistently 0.05 lower than that from the first command. Could you please help me understand why there might be a difference in the results between these two commands? Thank you for your time and assistance.

shirley-wu commented 1 month ago

Hello, yes that is expected. In main_batch_execute.py we follow the original viperGPT to compute accuracy, including a few techniques that boosts the accuracy by a bit. Mainly, if the code raises some error then try again with a fixed code (https://github.com/shirley-wu/vdebugger/blob/main/viper/main_batch_exec.py#L63), which is simply invoking the QA model itself (https://github.com/shirley-wu/vdebugger/blob/main/viper/prompts/fixed_code/blip2.prompt). However in such cases the execution traces will not make sense, so we do not include these techniques in main_batch_trace.py.

In principle, you should only use main_batch_trace.py to obtain traces, and always use main_batch_execute.py to get accuracy to get a comparable number