Great jobs.
I have some questions for the authors.
I run the code on the GPT-4 with the same parameter settings, but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the github .
FEVEROUS with GPT4: 91.05
FEVEROUS with text-davinci-003: 92.32 (presented in the github)
This result is very confusing.
I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?
Great jobs. I have some questions for the authors.