huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
687 stars 78 forks source link

Add Sympy equivalence for MATH / GSM8K? #170

Open lewtun opened 5 months ago

lewtun commented 5 months ago

In the Minerva and LLeMMa papers, sympy is used to ensure equivalence of predicted / gold answers, e.g. ensuring $1/ \sqrt{3}$ and $\sqrt{3}/3$ are treated the same. From the Minerva paper:

After applying this normalization function, we checked whether the formatted target and prediction strings are SymPy-equivalent. SymPy equivalence is determined by parsing the answers via sympy.parsing.latex.parse_latex and then checking whether substracting the two resulting SymPy objects and applying sympy.simplify gives zero. We set a timeout of 5s when calling sympy.simplify, and labeled strings as nonequivalent if this timeout was exceeded. For MATH problems, SymPy equivalence improved overall accuracy by around 1%. See Table 6 for the accuracies in MATH with only exact string match vs. SymPy equivalence.

Although the difference between Minerva & OpenAI models was only 1%, would it make sense to add sympy to the MATH metric for both correctness and potentially uncovering larger variation among open models?

clefourrier commented 4 months ago

Hi! Yes, I think that's a very good idea.