huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
659 stars 74 forks source link

issues when testing wikitext #57

Open yangfuwei opened 6 months ago

yangfuwei commented 6 months ago

Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:

  1. lighteval|wikitext|0|0
  2. helm|wikitext:103|0|0

When test wikitext, I got error Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'

I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) ends up with Nan perplexity.

When run wikitext:103, I got error like Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'

Commands I used are python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/" and python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"

Any advice on the issues? Thanks.

clefourrier commented 6 months ago

I'm investigating - it's likely that we missed changing the perplexity evals in our last refactor (before the release), so I'll add a patch for that - thanks a lot for the report!

yangfuwei commented 6 months ago
Hi @clefourrier , I saw the PR and I tried the new code, however, I'm confused about the test results, for example, for GPT2, the result is 19.1188, which is not match with the result in GPT2 paper (29.41) Task Version Metric Value Stderr
lighteval:wikitext:0 0 word_perplexity 19.1188 ± 0.2441
byte_perplexity 1.7364 ± 0.0034
bits_per_byte 0.7961 ± 0.0028
and I also tested google/gemma-2b, the results are as follows, it's unreasonable that the perplexity is so bad. Task Version Metric Value Stderr
lighteval:wikitext:0 0 word_perplexity 574.7692 ± 17.9129
byte_perplexity 3.2812 ± 0.0145
bits_per_byte 1.7142 ± 0.0064

Did you have the corresponding test results? Thanks.

clefourrier commented 6 months ago

Hi! It's very hard to reproduce this paper, as they don't describe precisely which subset of Wikitext they use (the raw form? the filtered form?) nor do they say (unless I missed it) if the perplexity they report is per word/byte or just another sort of average - do you have another model with more information that I could try/test this with?

clefourrier commented 6 months ago

However, I can confirm the mismatch with gemma2, investigating

clefourrier commented 6 months ago

I added more choices to make more explicit what each call is doing.

The Helm and Harness tasks use an aggregated version of the text at the document level, with the harness applying a normalisation preprocessing. The lighteval task uses the document as split in the dataset, so looks at the paragraph level rather than the document level, with no preprocessing.

I merged the above patch to make sure the evals are running, but I'm OK with taking more time to investigate the results if you feel like there is a need

yangfuwei commented 6 months ago

Hi @clefourrier Thanks for your reply, I don't have other recommend models because seems that recent models didn't report wikitext results. Could you please help with the Gemma model? The results of gemma on wikitext is unnormal, thank you.

clefourrier commented 6 months ago

Thanks a lot for your message! It seems to come from situations when the perplexity is supposed to be computed on something longer than the context length, I'm investigating.

yangfuwei commented 6 months ago

Hi @clefourrier is the code merged into the main branch? I used the latest version to test google/gemma-2b and gemma-7b on lighteval|wikitext:2|0|0, but still get weird results. For gemma-2b, I word_perplexity is 65.32650153145742 while for gemma-7b, word_perplexity is 359323042.28705126.

clefourrier commented 6 months ago

Hi @yangfuwei , Yes, it was.

Interesting bug on gemma-7b, I'll investigate again to see if I can reproduce, thanks!