lee-ny / teaching_arithmetic

MIT License
71 stars 19 forks source link

Potential Bug: Improvement in 3-Digit Addition Baseline by Adjusting Prompt Formatting #1

Open james016 opened 11 months ago

james016 commented 11 months ago

Hello there,

First off, thank you for the amazing work on "Teaching Arithmetic to Small Transformers" and for sharing the code. I'm a researcher who's been exploring your methods, and I find them quite enlightening.

However, while trying out your baseline tests and conducting some badcase analysis, I believe I've come across a potential bug that might impact the results presented in your paper.

The Main Issue:

My primary observation is that the lower accuracy rate for the plain baseline could be attributed to the prompt formatting used during the testing phase. After a small adjustment, the accuracy jumps from 87.27% to 95.58% without retraining the model. I suspect that if the model is retrained and the best-performing model on the validation set is selected, the accuracy could go up to around 97%, which is comparable to the 'plain2' model's results.

Explanation:

The change is quite simple: just prepend a newline (\n) character before the existing prompt during testing. The discovery came after noticing that even in the training dataset, the plain method could only achieve around a 90% accuracy, which seemed odd to me.

Upon further analysis, I found that the issue mostly occurs with arithmetic tasks of the form A2A1+C3C2C1= or A1+C3C2C1=, where GPT, being a next-word predictor, can sometimes match the input to an incorrect but similar-looking arithmetic equation. For example, if the test prompt is 1+234=235, and the training dataset contains \n21+234=255, the model may incorrectly produce 1+234=255.

Adding a newline character at the beginning, as in \n1+234=, prevents this issue. The model cannot match \n1+234= with \n21+234=255, thereby substantially improving accuracy.

I hope this observation is useful and I would love to know your thoughts on it.

Best regards, Su

lee-ny commented 11 months ago

Hello Su,

Thank you for your interest in our work and for highlighting the testing issue we overlooked. Your analysis is indeed valid and actually seems closely related to our observations! We noticed similar improvements using the $ symbol as a separator (denoted as plain2 in the repo), which is discussed in Appendix Figure 24 of our paper. We appreciate you bringing this to our attention. If you want to implement that as "plain3" (or something else), feel free to submit a PR. We will merge it and include the results in a revision.

Best regards, Nayoung

james016 commented 11 months ago

Hello Nayoung,

Thank you for your kind and encouraging response. I'm glad to hear that my observations align with your own findings. I'll make the necessary code modifications and submit a PR soon for what could be termed as "plain3."

Looking forward to further collaboration.

Best regards, Su