meta-llama / llama-models

Utilities intended for use with Llama models.
Other
3.72k stars 643 forks source link

Can you provide more details for MATH evaluation? #51

Open Kipok opened 1 month ago

Kipok commented 1 month ago

Is it possible to provide more details about the MATH benchmark evaluation https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#math?

E.g. would be great to know how exactly sympy is used to compare answers. Also how is sympy score combined with LLM-as-a-judge score? How "complex expressions" are defined? What do you do when sympy and judge scores disagree?

nectariferous commented 1 month ago

SymPy is usually used to compare mathematical expressions symbolically. This means it can determine if two expressions are equivalent even if they're written differently. For example, SymPy would recognize that "x^2 + 2x + 1" is equivalent to "(x+1)^2".

Combining SymPy and LLM-as-judge scores: Typically, these scores are combined using a weighted average or some form of ensemble method. The exact weights or method would depend on the specific evaluation setup. Sometimes, one method might be used as a primary score and the other as a tiebreaker.

Definition of "complex expressions": This would likely refer to mathematical expressions that involve multiple operations, variables, or functions. Things like integrals, limits, or expressions with nested fractions might be considered complex. The exact definition would be specified in the evaluation guidelines.

Handling disagreements between SymPy and judge scores: In cases of disagreement, there might be a defined protocol. Some possibilities:

sriramsowmithri9807 commented 1 month ago

@nectariferous bro, the approach you posted is good to compare but, u will find some errors

sriramsowmithri9807 commented 1 month ago

@Kipok The MATH benchmark uses SymPy to verify the correctness of model-generated answers by simplifying expressions and checking equality. Scores from SymPy and the LLM-as-a-judge are combined, with SymPy likely prioritized for accuracy. Complex expressions are defined by their multiple operations, and discrepancies between scores are handled by favoring the SymPy evaluation.

sriramsowmithri9807 commented 1 month ago

@Kipok We can also use statsmodels to improve some math in LLM