Inquiry Regarding Discrepancy in Final Accuracy Results

shirley-wu / vdebugger

[EMNLP2024 Findings] VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Apache License 2.0

3 stars 1 forks source link

Hello, yes that is expected. In main_batch_execute.py we follow the original viperGPT to compute accuracy, including a few techniques that boosts the accuracy by a bit. Mainly, if the code raises some error then try again with a fixed code (https://github.com/shirley-wu/vdebugger/blob/main/viper/main_batch_exec.py#L63), which is simply invoking the QA model itself (https://github.com/shirley-wu/vdebugger/blob/main/viper/prompts/fixed_code/blip2.prompt). However in such cases the execution traces will not make sense, so we do not include these techniques in main_batch_trace.py.

In principle, you should only use main_batch_trace.py to obtain traces, and always use main_batch_execute.py to get accuracy to get a comparable number

shirley-wu / vdebugger

Inquiry Regarding Discrepancy in Final Accuracy Results #1