openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.29k stars 372 forks source link

Could not reproduce the evaluation results #44

Open guanqun-yang opened 1 year ago

guanqun-yang commented 1 year ago

Hi,

I am trying to reproduce your reported numbers using the command provided by LM Evaluation Harness. One of the commands look like following:

python main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_3b,use_accelerate=True,dtype=half \
--tasks  arc_challenge \
--batch_size 8 \
--num_fewshots 25 \
--write_out

which gave me an directory of .json files that look like the following:

├── arc_challenge_write_out_info.json
├── hellaswag_write_out_info.json
└── truthfulqa_mc_write_out_info.json

I tried to compute the final results using the script below but found that the numbers I obtained were quite different from what you reported. I don't know which part went wrong.

model arc_challenge hellaswag truthfulqa_mc
openlm-research/open_llama_3b 0.260239 0.25941 0.487843
openlm-research/open_llama_7b 0.261092 0.262298 0.483711

Here is the script I used to create the table:

task_dict = \
{'arc_challenge': {'metric': 'acc_norm',
                   'shot': 25,
                   'task_name': 'arc_challenge'},
 'hellaswag': {'metric': 'acc_norm', 'shot': 10, 'task_name': 'hellaswag'},
 'truthfulqa_mc': {'metric': 'mc2', 'shot': 0, 'task_name': 'truthfulqa_mc'}}

models = [
    "openlm-research/open_llama_3b",
    "openlm-research/open_llama_7b",
]

records = list()
for model in tqdm(models):
    for task, d in task_dict.items():
        task_name = d["task_name"]
        metric_name = d["metric"]

        df = pd.read_json(f"results/{model}/{task_name}.json")
        records.append(
            {
                "model": model,
                "task": task,
                "metric_name": metric_name,
                "metric": df[metric_name].mean(),
            }
        )

stat_df = pd.DataFrame(records)
stat_df = pd.pivot_table(stat_df, index="model", columns="task", values="metric")

print(stat_df.to_markdown())
young-geng commented 1 year ago

We believe that this is related to the HF fast tokenizer problem mentioned in the readme here. You'll need to avoid using the auto-converted fast tokenizer to get correct tokenization.

That being said our evaluation is ran in JAX instead of PyTorch. You can follow the evaluation doc of our framework to reproduce our evaluation.

guanqun-yang commented 1 year ago

Thank you for your prompt response @young-geng! But after correcting the said mistake to the expected use_fast=False and rerunning the entire evaluation gave me the same near-random results (around 25%), still quite different from what you reported.

I am unsure if your process of converting from the JAX format to the torch format is airtight or not.

chi2liu commented 1 year ago

@guanqun-yang @young-geng did you use the few-shot zero? I used the default 0 fewshot, the results is almost same.

young-geng commented 1 year ago

@guanqun-yang I just ran 25 shot arc_challenge with use_fast=False, and here's my result:

Task Version Metric Value Stderr
arc_challenge 0 acc 0.4369 ± 0.0145
acc_norm 0.4735 ± 0.0146

Here's my result for 10 shots hellaswag:

Task Version Metric Value Stderr
hellaswag 0 acc 0.5358 ± 0.0050
acc_norm 0.7205 ± 0.0045

These results do match the evaluation we did in JAX. I've also numerically compared the JAX model with the PyTorch model and the logits output matches pretty well (around 1e-8 error on CPU, the error is higher on GPU depends on the precision used).

guanqun-yang commented 1 year ago

@chi2liu I am trying to reproduce the numbers of Open LLM Benchmark, which specifies the number of shots for each task.

guanqun-yang commented 1 year ago

@young-geng Thank you for reproducing the results! Did you try to obtain the write-out .json files? I computed the metrics based on those files.

young-geng commented 1 year ago

@guanqun-yang I did not save those json files, but I did use the same 25 shots for arc_challenge and 10 shots for hellaswag, which is the same for Open LLM Leaderboard. I just realized that you are evaluating the 3b model and I was evaluating the 7b model. Let me also try the 3b model.

young-geng commented 1 year ago
Here's the results for the 3b model: Task Version Metric Value Stderr
arc_challenge 0 acc 0.3686 ± 0.0141
acc_norm 0.4096 ± 0.0144
Task Version Metric Value Stderr
hellaswag 0 acc 0.4956 ± 0.0050
acc_norm 0.6681 ± 0.0047
guanqun-yang commented 1 year ago

@young-geng Thank you for reproducing the evaluations! It could be something subtle that causes the issues. Let me double check the report back here. Also, are you using the same command as I did or something different?

guanqun-yang commented 1 year ago

@young-geng It seems that I have located the issue. I am able to reproduce the reported number using the command:

python ../lm-evaluation-harness/main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_7b,use_accelerate=True,dtype=half \
--batch_size 16 \
--tasks arc_challenge \
--num_fewshot 25 \
--write_out \
--output_base_path <path>

Here is what I believe that caused the difference:

though this difference may sound unlikely to cause the difference.

currents-abhishek commented 1 year ago

:detective: @guanqun-yang -- Looks like the original command you posted has num_fewshots and the above one has num_fewshot