Open rankofootball opened 2 months ago
I found by inspecting the parquet output that gold and prediction differ in a leading space: [' A'] ['C'] [' B'] ['C'] [' C'] ['C'] [' A'] ['A'] [' B'] ['B'] [' B'] ['B'] [' A'] ['B'] [' D'] ['D'] [' A'] ['B'] [' B'] ['C']
is the space in gold normal?
Hi!
This is due to tokenization issues iirc.
A simple fix would be to change the task to be Answer:
, then gold: A
instead of Answer:
, A
@NathanHB wdyt?
command:
accelerate launch run_evals_accelerate.py --model_args="Llama-2-7b-chat-hf-8bit,quantization_config="load_in_8bit=True"" --tasks "helm|hellaswag|1|0" -- --output_dir ./evalscratch
Result is 0% correct Llama-3 works fine as does MMLU for Llama-2.
Is there any way to log the individual outputs?