[BUG] Question on batch preparation in MMLU evaluation

JefferyChen453 commented 2 months ago

The bug I met is similar to #203. I'm trying to reproduce the evaluation results of ablation model trained on FineWeb, using LightEval of commit_id=a98210fd3a2d1e8bface1c32b72ebd5017173a4c.

The MMLU result of step-5000/10000/15000/19000/24000 (namely, 5 ckpts from the first 50b consumed tokens) are as below: img_v3_02ed_4a3641b1-a270-4aed-84cc-9b47ce4447eg

I don't know what causes this gap, when debugging I discover that:

The last token of the prepared_batch is missing. Does this mean the evaluation results of fineweb blogpost is inaccurate?

But when I delete [:-1] in https://github.com/huggingface/lighteval/blob/aaa8bbf705b6f090fb07ad36503f39b5e922a6df/src/lighteval/models/base_model.py#L851

The evaluation results became totally random guess for all ckpts. I suppose there are more lines to modify, or something else caused the gaps in my reproduction results.

JefferyChen453 commented 2 months ago

I've tried adding the param add_special_tokens=True in config file but the last token is still missing

JefferyChen453 commented 2 months ago

Using the latest repo (commit_id = 7261d80d5679cd91c5c20cf2a7823f092ff66251), I evaluated the same 5 ckpts again (red line in figure). The results are still below the official results. plot_mmlu_acc_norm

And when examining the prepared_batch, the last token still seemed to be missing.

My command:

accelerate launch --num_processes=1 -m \
    lighteval accelerate \
    --model_args="pretrained=/mnt/data/user/tc_agi/caijie/fineweb_models/ablation-model-fineweb-v1_5000,trust_remote_code=True" \
    --override_batch_size 128 \
    --custom_tasks "/data/fineweb-pipeline/lighteval-main/lighteval_tasks.py" \
    --output_dir "/data/fineweb-pipeline/lighteval-main/evals/" \
    --tasks "custom|mmlu:abstract_algebra|0|1"

clefourrier commented 2 months ago

Thanks for the report, we'll investigate! cc @hynky1999 and @guipenedo for the fineweb aspect

huggingface / lighteval

[BUG] Question on batch preparation in MMLU evaluation #288