Dear authors, thank you for the amazing work and sharing your code and data!
I wanted to ask about your evaluation code, as currently if the model outputs an answer with decimal point, it automatically rounds to the nearest integer.
In this way, a wrong answer (i.e. 8.5) could be considered correct (i.e. as 9), in spite of a calculation error, which indeed often occurs with some model generations.
In this light, I believe a stricter evaluation code may be needed.
Dear authors, thank you for the amazing work and sharing your code and data!
I wanted to ask about your evaluation code, as currently if the model outputs an answer with decimal point, it automatically rounds to the nearest integer.
In this way, a wrong answer (i.e. 8.5) could be considered correct (i.e. as 9), in spite of a calculation error, which indeed often occurs with some model generations.
In this light, I believe a stricter evaluation code may be needed.