stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.95k stars 252 forks source link

`write_run_display_json` failed with KeyError: 'mean' #1263

Closed teetone closed 1 year ago

teetone commented 1 year ago
Traceback (most recent call last):
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/bin/helm-summarize", line 8, in <module>
    sys.exit(main())
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/common/hierarchical_logger.py", line 104, in wrapper
    return fn(*args, **kwargs)
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/benchmark/presentation/summarize.py", line 973, in main
    summarizer.write_run_display_json()
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/benchmark/presentation/summarize.py", line 932, in write_run_display_json
    parallel_map(process, self.runs, parallelism=self.num_threads)
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/common/general.py", line 203, in parallel_map
    results = list(tqdm(executor.map(process, items), total=len(items)))
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/benchmark/presentation/summarize.py", line 930, in process
    write_run_display_json(run.run_path, run.run_spec, self.schema)
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/common/hierarchical_logger.py", line 104, in wrapper
    return fn(*args, **kwargs)
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/benchmark/presentation/run_display.py", line 209, in write_run_display_json
    stats_dict: Dict[str, float] = {
  File "/juice2/scr2/nlp/crfm/benchmarking/benchmarking/src/helm/benchmark/presentation/run_display.py", line 210, in <dictcomp>
    original_stat["name"]["name"]: cast(float, original_stat["mean"])
KeyError: 'mean'
teetone commented 1 year ago

I only see it fail for synthetic_efficiency:

                  write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=2: 'mean'
                  write_run_display_json {
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=2
                  } [0.054s]
                  write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=64: 'mean'
                  write_run_display_json {
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=4
                  } [0.018s]
                  write_run_display_json failed for synthetic_efficiency:random=None,model=ai21_j1-large,tokenizer=ai21_j1,num_prompt_tokens=1536,num_output_tokens=64: 'mean'
                  write_run_display_json {
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=8
                  } [0.02s]
                  write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=1: 'mean'
                  write_run_display_json {
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=16
                  } [0.004s]
                  write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=4: 'mean'
                  write_run_display_json {
^M 63%|██████▎   | 3466/5517 [3:33:01<01:16, 26.74it/s]Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=32
                  } [0.034s]
                } [0.254s]
                write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=16: 'mean'
                write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=8: 'mean'
                write_run_display_json {
              write_run_display_json {
              } [0.303s]
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=64
              } [0.01s]
              } [0.019s]
              write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=4: 'mean'
              write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=256,num_output_tokens=2: 'mean'
              write_run_display_json {
Processing synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=512,num_output_tokens=1
                write_run_display_json failed for synthetic_efficiency:random=None,model=anthropic_stanford-online-all-v4-s3,tokenizer=huggingface_gpt2,num_prompt_tokens=1,num_output_tokens=32: 'mean'