The gsm-hard dataset contains some issues (negative number targets)

reasoning-machines / pal

PaL: Program-Aided Language Models (ICML 2023)

Apache License 2.0

462 stars 58 forks source link

Hi @Madd0g , Thank you for your interest in our work!

Your observation is correct. Since the GSM-Hard benchmark was created automatically, it may contain negative target values or "unnatural" positive values. Unfortunately, we do not have the resources to manually annotate all examples, so our assumption is that there is a penalty of 5%-10% drop in performance for all models and prompting approaches that are evaluated on this benchmark. Since this penalty is similar to all approaches, we believe that the relative comparison between different approaches is the right thing to measure.

Best, Uri

reasoning-machines / pal

The gsm-hard dataset contains some issues (negative number targets) #12