Open yangfuwei opened 6 months ago
I'm investigating - it's likely that we missed changing the perplexity evals in our last refactor (before the release), so I'll add a patch for that - thanks a lot for the report!
Hi @clefourrier , I saw the PR and I tried the new code, however, I'm confused about the test results, for example, for GPT2, the result is 19.1188, which is not match with the result in GPT2 paper (29.41) | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
lighteval:wikitext:0 | 0 | word_perplexity | 19.1188 | ± | 0.2441 | |
byte_perplexity | 1.7364 | ± | 0.0034 | |||
bits_per_byte | 0.7961 | ± | 0.0028 |
and I also tested google/gemma-2b, the results are as follows, it's unreasonable that the perplexity is so bad. | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
lighteval:wikitext:0 | 0 | word_perplexity | 574.7692 | ± | 17.9129 | |
byte_perplexity | 3.2812 | ± | 0.0145 | |||
bits_per_byte | 1.7142 | ± | 0.0064 |
Did you have the corresponding test results? Thanks.
Hi! It's very hard to reproduce this paper, as they don't describe precisely which subset of Wikitext they use (the raw form? the filtered form?) nor do they say (unless I missed it) if the perplexity they report is per word/byte or just another sort of average - do you have another model with more information that I could try/test this with?
However, I can confirm the mismatch with gemma2, investigating
I added more choices to make more explicit what each call is doing.
The Helm and Harness tasks use an aggregated version of the text at the document level, with the harness applying a normalisation preprocessing. The lighteval task uses the document as split in the dataset, so looks at the paragraph level rather than the document level, with no preprocessing.
I merged the above patch to make sure the evals are running, but I'm OK with taking more time to investigate the results if you feel like there is a need
Hi @clefourrier Thanks for your reply, I don't have other recommend models because seems that recent models didn't report wikitext results. Could you please help with the Gemma model? The results of gemma on wikitext is unnormal, thank you.
Thanks a lot for your message! It seems to come from situations when the perplexity is supposed to be computed on something longer than the context length, I'm investigating.
Hi @clefourrier is the code merged into the main branch? I used the latest version to test google/gemma-2b and gemma-7b on lighteval|wikitext:2|0|0, but still get weird results. For gemma-2b, I word_perplexity is 65.32650153145742 while for gemma-7b, word_perplexity is 359323042.28705126.
Hi @yangfuwei , Yes, it was.
Interesting bug on gemma-7b, I'll investigate again to see if I can reproduce, thanks!
Hi, I'm using the lighteval to test several benchmarks, but I met issues when testing the following 2 benchmarks:
When test wikitext, I got error
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 615, in create_requests_from_tasks reqs = task.construct_requests(doc, ctx, doc_id_seed, cur_task_name) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 380, in construct_requests LoglikelihoodRollingRequest(task_name=current_task_name, doc_id=document_id_seed, ctx=context)#LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context) TypeError: LoglikelihoodRollingRequest.__init__() got an unexpected keyword argument 'doc_id'
I modified https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/lighteval_task.py#L380 to
LoglikelihoodRollingRequest(task_name=current_task_name, example_index=document_id_seed, request_index=0, context=context)
ends up with Nan perplexity.When run wikitext:103, I got error like
Traceback (most recent call last): File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/run_evals_accelerate.py", line 97, in <module> main(args) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/logging/hierarchical_logger.py", line 144, in wrapper return fn(*args, **kwargs) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/main_accelerate.py", line 71, in main requests, docs = create_requests_from_tasks( File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in create_requests_from_tasks task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 563, in <listcomp> task_dict_items = [(name, task) for name, task in task_dict.items() if len(task.eval_docs()) > 0] File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 296, in eval_docs self._docs = self._get_docs_from_split(self.evaluation_split) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/lighteval_task.py", line 266, in _get_docs_from_split docs.extend(as_list(self.formatter(item, self.name))) File "/group/ossdphi_algo_scratch_04/fuweiy/LLM/eval_v100/lighteval/src/lighteval/tasks/tasks_prompt_formatting.py", line 2068, in wikitext_103 return Doc(task_name=task_name, query=line["text"]) TypeError: Doc.__init__() missing 2 required positional arguments: 'choices' and 'gold_index'
Commands I used are
python run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "lighteval|wikitext|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"
andpython run_evals_accelerate.py --model_args="pretrained=gpt2" --tasks "helm|wikitext:103|0|0" --override_batch_size 1 --save_details --output_dir="tmp/"
Any advice on the issues? Thanks.