In the Minerva and LLeMMa papers, sympy is used to ensure equivalence of predicted / gold answers, e.g. ensuring $1/ \sqrt{3}$ and $\sqrt{3}/3$ are treated the same. From the Minerva paper:
After applying this normalization function, we checked whether the formatted target and prediction strings are SymPy-equivalent. SymPy equivalence is determined by parsing the answers via sympy.parsing.latex.parse_latex and then checking whether substracting the two resulting SymPy objects and applying sympy.simplify gives zero. We set a timeout of 5s when calling sympy.simplify, and labeled strings as nonequivalent if this timeout was exceeded.
For MATH problems, SymPy equivalence improved overall accuracy by around 1%. See Table 6 for the accuracies in MATH with only exact string match vs. SymPy equivalence.
Although the difference between Minerva & OpenAI models was only 1%, would it make sense to add sympy to the MATH metric for both correctness and potentially uncovering larger variation among open models?
In the Minerva and LLeMMa papers,
sympy
is used to ensure equivalence of predicted / gold answers, e.g. ensuring $1/ \sqrt{3}$ and $\sqrt{3}/3$ are treated the same. From the Minerva paper:Although the difference between Minerva & OpenAI models was only 1%, would it make sense to add
sympy
to the MATH metric for both correctness and potentially uncovering larger variation among open models?