Results on GPT-4 are lower than the reuslts presented in the paper?

Great jobs. I have some questions for the authors.

I run the code on the GPT-4 with the same parameter settings, but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the github . FEVEROUS with GPT4: 91.05 FEVEROUS with text-davinci-003: 92.32 （presented in the github） This result is very confusing.
I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?

teacherpeterpan / ProgramFC