Can you provide more details for MATH evaluation?

Kipok commented 1 month ago

Is it possible to provide more details about the MATH benchmark evaluation https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#math?

E.g. would be great to know how exactly sympy is used to compare answers. Also how is sympy score combined with LLM-as-a-judge score? How "complex expressions" are defined? What do you do when sympy and judge scores disagree?

nectariferous commented 1 month ago

SymPy is usually used to compare mathematical expressions symbolically. This means it can determine if two expressions are equivalent even if they're written differently. For example, SymPy would recognize that "x^2 + 2x + 1" is equivalent to "(x+1)^2".

Combining SymPy and LLM-as-judge scores: Typically, these scores are combined using a weighted average or some form of ensemble method. The exact weights or method would depend on the specific evaluation setup. Sometimes, one method might be used as a primary score and the other as a tiebreaker.

Definition of "complex expressions": This would likely refer to mathematical expressions that involve multiple operations, variables, or functions. Things like integrals, limits, or expressions with nested fractions might be considered complex. The exact definition would be specified in the evaluation guidelines.

Handling disagreements between SymPy and judge scores: In cases of disagreement, there might be a defined protocol. Some possibilities:

Use human reviewers for final judgment
Favor one method over the other (e.g., trust SymPy for purely symbolic questions)
Apply a more stringent threshold for considering an answer correct

The evaluation might use multiple prompts or few-shot examples to test the model's performance under different conditions.
- There could be separate scoring for the correctness of the final answer and the quality of the problem-solving approach.

sriramsowmithri9807 commented 1 month ago

@nectariferous bro, the approach you posted is good to compare but, u will find some errors

sriramsowmithri9807 commented 1 month ago

@Kipok The MATH benchmark uses SymPy to verify the correctness of model-generated answers by simplifying expressions and checking equality. Scores from SymPy and the LLM-as-a-judge are combined, with SymPy likely prioritized for accuracy. Complex expressions are defined by their multiple operations, and discrepancies between scores are handled by favoring the SymPy evaluation.

sriramsowmithri9807 commented 1 month ago

@Kipok We can also use statsmodels to improve some math in LLM

meta-llama / llama-models

Can you provide more details for MATH evaluation? #51