Could not reproduce the evaluation results

guanqun-yang commented 1 year ago

Hi,

I am trying to reproduce your reported numbers using the command provided by LM Evaluation Harness. One of the commands look like following:

python main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_3b,use_accelerate=True,dtype=half \
--tasks  arc_challenge \
--batch_size 8 \
--num_fewshots 25 \
--write_out

which gave me an directory of .json files that look like the following:

├── arc_challenge_write_out_info.json
├── hellaswag_write_out_info.json
└── truthfulqa_mc_write_out_info.json

I tried to compute the final results using the script below but found that the numbers I obtained were quite different from what you reported. I don't know which part went wrong.

model	arc_challenge	hellaswag	truthfulqa_mc
openlm-research/open_llama_3b	0.260239	0.25941	0.487843
openlm-research/open_llama_7b	0.261092	0.262298	0.483711

Here is the script I used to create the table:

task_dict = \
{'arc_challenge': {'metric': 'acc_norm',
                   'shot': 25,
                   'task_name': 'arc_challenge'},
 'hellaswag': {'metric': 'acc_norm', 'shot': 10, 'task_name': 'hellaswag'},
 'truthfulqa_mc': {'metric': 'mc2', 'shot': 0, 'task_name': 'truthfulqa_mc'}}

models = [
    "openlm-research/open_llama_3b",
    "openlm-research/open_llama_7b",
]

records = list()
for model in tqdm(models):
    for task, d in task_dict.items():
        task_name = d["task_name"]
        metric_name = d["metric"]

        df = pd.read_json(f"results/{model}/{task_name}.json")
        records.append(
            {
                "model": model,
                "task": task,
                "metric_name": metric_name,
                "metric": df[metric_name].mean(),
            }
        )

stat_df = pd.DataFrame(records)
stat_df = pd.pivot_table(stat_df, index="model", columns="task", values="metric")

print(stat_df.to_markdown())

young-geng commented 1 year ago

We believe that this is related to the HF fast tokenizer problem mentioned in the readme here. You'll need to avoid using the auto-converted fast tokenizer to get correct tokenization.

That being said our evaluation is ran in JAX instead of PyTorch. You can follow the evaluation doc of our framework to reproduce our evaluation.

guanqun-yang commented 1 year ago

Thank you for your prompt response @young-geng! But after correcting the said mistake to the expected use_fast=False and rerunning the entire evaluation gave me the same near-random results (around 25%), still quite different from what you reported.

I am unsure if your process of converting from the JAX format to the torch format is airtight or not.

chi2liu commented 1 year ago

@guanqun-yang @young-geng did you use the few-shot zero? I used the default 0 fewshot, the results is almost same.

young-geng commented 1 year ago

@guanqun-yang I just ran 25 shot arc_challenge with use_fast=False, and here's my result:

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.4369	±	0.0145
		acc_norm	0.4735	±	0.0146

Here's my result for 10 shots hellaswag:

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5358	±	0.0050
		acc_norm	0.7205	±	0.0045

These results do match the evaluation we did in JAX. I've also numerically compared the JAX model with the PyTorch model and the logits output matches pretty well (around 1e-8 error on CPU, the error is higher on GPU depends on the precision used).

guanqun-yang commented 1 year ago

@chi2liu I am trying to reproduce the numbers of Open LLM Benchmark, which specifies the number of shots for each task.

guanqun-yang commented 1 year ago

@young-geng Thank you for reproducing the results! Did you try to obtain the write-out .json files? I computed the metrics based on those files.

young-geng commented 1 year ago

@guanqun-yang I did not save those json files, but I did use the same 25 shots for arc_challenge and 10 shots for hellaswag, which is the same for Open LLM Leaderboard. I just realized that you are evaluating the 3b model and I was evaluating the 7b model. Let me also try the 3b model.

young-geng commented 1 year ago

Here's the results for the 3b model:	Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.3686	±	0.0141
		acc_norm	0.4096	±	0.0144

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.4956	±	0.0050
		acc_norm	0.6681	±	0.0047

guanqun-yang commented 1 year ago

@young-geng Thank you for reproducing the evaluations! It could be something subtle that causes the issues. Let me double check the report back here. Also, are you using the same command as I did or something different?

guanqun-yang commented 1 year ago

@young-geng It seems that I have located the issue. I am able to reproduce the reported number using the command:

python ../lm-evaluation-harness/main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_7b,use_accelerate=True,dtype=half \
--batch_size 16 \
--tasks arc_challenge \
--num_fewshot 25 \
--write_out \
--output_base_path <path>

Here is what I believe that caused the difference:

What I did was first download the model somewhere and then load it from that custom directory.
But you were letting the transformers to handle the downloading and load the model from $HF_HOME.

though this difference may sound unlikely to cause the difference.

currents-abhishek commented 1 year ago

:detective: @guanqun-yang -- Looks like the original command you posted has num_fewshots and the above one has num_fewshot

openlm-research / open_llama

Could not reproduce the evaluation results #44