madaan / self-refine

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
https://selfrefine.info
Apache License 2.0
596 stars 48 forks source link

GSM8K performance difference issue #11

Closed allanj closed 1 year ago

allanj commented 1 year ago

In the appendix, the original PAL with ChatGPT is around 74%.

image

But how come the initial accuracy is only 71% in self-refine, I was expecting the initial should be the same?

image
madaan commented 1 year ago

Thanks for pointing this out. The results in Figure 14 use code-davinci-002 (codex), which match the numbers reported in PaL (72%). We will clarify this in the next update.