Open james016 opened 1 year ago
Hello Su,
Thank you for your interest in our work and for highlighting the testing issue we overlooked.
Your analysis is indeed valid and actually seems closely related to our observations! We noticed similar improvements using the $
symbol as a separator (denoted as plain2
in the repo), which is discussed in Appendix Figure 24 of our paper.
We appreciate you bringing this to our attention. If you want to implement that as "plain3" (or something else), feel free to submit a PR. We will merge it and include the results in a revision.
Best regards, Nayoung
Hello Nayoung,
Thank you for your kind and encouraging response. I'm glad to hear that my observations align with your own findings. I'll make the necessary code modifications and submit a PR soon for what could be termed as "plain3."
Looking forward to further collaboration.
Best regards, Su
Hello there,
First off, thank you for the amazing work on "Teaching Arithmetic to Small Transformers" and for sharing the code. I'm a researcher who's been exploring your methods, and I find them quite enlightening.
However, while trying out your baseline tests and conducting some badcase analysis, I believe I've come across a potential bug that might impact the results presented in your paper.
The Main Issue:
My primary observation is that the lower accuracy rate for the plain baseline could be attributed to the prompt formatting used during the testing phase. After a small adjustment, the accuracy jumps from 87.27% to 95.58% without retraining the model. I suspect that if the model is retrained and the best-performing model on the validation set is selected, the accuracy could go up to around 97%, which is comparable to the 'plain2' model's results.
Explanation:
The change is quite simple: just prepend a newline (
\n
) character before the existing prompt during testing. The discovery came after noticing that even in the training dataset, the plain method could only achieve around a 90% accuracy, which seemed odd to me.Upon further analysis, I found that the issue mostly occurs with arithmetic tasks of the form
A2A1+C3C2C1=
orA1+C3C2C1=
, where GPT, being a next-word predictor, can sometimes match the input to an incorrect but similar-looking arithmetic equation. For example, if the test prompt is1+234=235
, and the training dataset contains\n21+234=255
, the model may incorrectly produce1+234=255
.Adding a newline character at the beginning, as in
\n1+234=
, prevents this issue. The model cannot match\n1+234=
with\n21+234=255
, thereby substantially improving accuracy.I hope this observation is useful and I would love to know your thoughts on it.
Best regards, Su