Open guanqun-yang opened 1 year ago
We believe that this is related to the HF fast tokenizer problem mentioned in the readme here. You'll need to avoid using the auto-converted fast tokenizer to get correct tokenization.
That being said our evaluation is ran in JAX instead of PyTorch. You can follow the evaluation doc of our framework to reproduce our evaluation.
Thank you for your prompt response @young-geng! But after correcting the said mistake to the expected use_fast=False
and rerunning the entire evaluation gave me the same near-random results (around 25%), still quite different from what you reported.
I am unsure if your process of converting from the JAX format to the torch
format is airtight or not.
@guanqun-yang @young-geng did you use the few-shot zero? I used the default 0 fewshot, the results is almost same.
@guanqun-yang I just ran 25 shot arc_challenge with use_fast=False
, and here's my result:
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 0.4369 | ± | 0.0145 |
acc_norm | 0.4735 | ± | 0.0146 |
Here's my result for 10 shots hellaswag:
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 0 | acc | 0.5358 | ± | 0.0050 |
acc_norm | 0.7205 | ± | 0.0045 |
These results do match the evaluation we did in JAX. I've also numerically compared the JAX model with the PyTorch model and the logits output matches pretty well (around 1e-8 error on CPU, the error is higher on GPU depends on the precision used).
@chi2liu I am trying to reproduce the numbers of Open LLM Benchmark, which specifies the number of shots for each task.
@young-geng Thank you for reproducing the results! Did you try to obtain the write-out .json
files? I computed the metrics based on those files.
@guanqun-yang I did not save those json files, but I did use the same 25 shots for arc_challenge and 10 shots for hellaswag, which is the same for Open LLM Leaderboard. I just realized that you are evaluating the 3b model and I was evaluating the 7b model. Let me also try the 3b model.
Here's the results for the 3b model: | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
arc_challenge | 0 | acc | 0.3686 | ± | 0.0141 | |
acc_norm | 0.4096 | ± | 0.0144 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 0 | acc | 0.4956 | ± | 0.0050 |
acc_norm | 0.6681 | ± | 0.0047 |
@young-geng Thank you for reproducing the evaluations! It could be something subtle that causes the issues. Let me double check the report back here. Also, are you using the same command as I did or something different?
@young-geng It seems that I have located the issue. I am able to reproduce the reported number using the command:
python ../lm-evaluation-harness/main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_7b,use_accelerate=True,dtype=half \
--batch_size 16 \
--tasks arc_challenge \
--num_fewshot 25 \
--write_out \
--output_base_path <path>
Here is what I believe that caused the difference:
transformers
to handle the downloading and load the model from $HF_HOME
.though this difference may sound unlikely to cause the difference.
:detective: @guanqun-yang -- Looks like the original command you posted has num_fewshots
and the above one has num_fewshot
Hi,
I am trying to reproduce your reported numbers using the command provided by LM Evaluation Harness. One of the commands look like following:
which gave me an directory of
.json
files that look like the following:I tried to compute the final results using the script below but found that the numbers I obtained were quite different from what you reported. I don't know which part went wrong.
Here is the script I used to create the table: