huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
807 stars 96 forks source link

[BUG] Zero accuracy in Hellaswag for Llama-2-7b (using 8bit quantization) #275

Open rankofootball opened 2 months ago

rankofootball commented 2 months ago

command:

accelerate launch run_evals_accelerate.py --model_args="Llama-2-7b-chat-hf-8bit,quantization_config="load_in_8bit=True"" --tasks "helm|hellaswag|1|0" -- --output_dir ./evalscratch

Result is 0% correct Llama-3 works fine as does MMLU for Llama-2.

Is there any way to log the individual outputs?

rankofootball commented 2 months ago

I found by inspecting the parquet output that gold and prediction differ in a leading space: [' A'] ['C'] [' B'] ['C'] [' C'] ['C'] [' A'] ['A'] [' B'] ['B'] [' B'] ['B'] [' A'] ['B'] [' D'] ['D'] [' A'] ['B'] [' B'] ['C']

is the space in gold normal?

clefourrier commented 2 months ago

Hi! This is due to tokenization issues iirc. A simple fix would be to change the task to be Answer: , then gold: A instead of Answer:, A @NathanHB wdyt?