GSM8K performance difference issue

madaan / self-refine

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

https://selfrefine.info

Apache License 2.0

596 stars 48 forks source link

GSM8K performance difference issue #11

Closed allanj closed 1 year ago

allanj commented 1 year ago

In the appendix, the original PAL with ChatGPT is around 74%.

But how come the initial accuracy is only 71% in self-refine, I was expecting the initial should be the same?

madaan commented 1 year ago

Thanks for pointing this out. The results in Figure 14 use code-davinci-002 (codex), which match the numbers reported in PaL (72%). We will clarify this in the next update.