stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.86k stars 243 forks source link

Unexpected behaviors of export_scenario_text.py #1662

Closed YianZhang closed 11 months ago

YianZhang commented 1 year ago

Besides the introduction of instruction-following scenarios (expected), there are some other differences between the ligh_scenarios exported by the new export_scenario_text.py script and our old version.

The differences include:

  1. ICE is missing
  2. MultiLexSum is missing
  3. NewsQA is missing
  4. Efficiency and robustness scenarios are not removed.
  5. Might be others, please check.

See the diff below:

< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.anthropic_hh_rlhf_scenario.AnthropicHHRLHFScenario', 'args': {'subset': 'hh'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.anthropic_hh_rlhf_scenario.AnthropicHHRLHFScenario', 'args': {'subset': 'hh'}}, 'split': 'train'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.anthropic_hh_rlhf_scenario.AnthropicHHRLHFScenario', 'args': {'subset': 'red_team'}}, 'split': 'test'}}
89,90d85
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.boolq_scenario.BoolQScenario', 'args': {'only_contrast': 'True'}}, 'split': 'train'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.boolq_scenario.BoolQScenario', 'args': {'only_contrast': 'True'}}, 'split': 'valid'}}
160d154
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.grammar_scenario.GrammarScenario', 'args': {'path': 'src/helm/benchmark/scenarios/best_chatgpt_prompts.yaml', 'tags': ''}}, 'split': 'test'}}
163,164c157,198
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.imdb_scenario.IMDBScenario', 'args': {'only_contrast': 'True'}}, 'split': 'train'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.imdb_scenario.IMDBScenario', 'args': {'only_contrast': 'True'}}, 'split': 'valid'}}
---
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'category': 'S'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'category': 'W'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'gender': 'female'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'gender': 'male'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'can', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'can', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'can', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'can', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'can'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ea', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ea', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ea', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ea', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ea'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'hk', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'hk', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'hk', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'hk', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'hk'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ind', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ind', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ind', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ind', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ind'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ja', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ja', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ja', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ja', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'ja'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'phi', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'phi', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'phi', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'phi', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'phi'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'sin', 'category': 'S1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'sin', 'category': 'S2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'sin', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'sin', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'sin'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'usa', 'category': 'W1'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'usa', 'category': 'W2'}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.ice_scenario.ICEScenario', 'args': {'subset': 'usa'}}, 'split': 'test'}}
167d200
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.koala_scenario.KoalaScenario', 'args': {}}, 'split': 'test'}}
172a206,208
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.legal_summarization_scenario.LegalSummarizationScenario', 'args': {'dataset_name': 'MultiLexSum', 'sampling_min_length': 100, 'sampling_max_length': 400, 'doc_max_length': 1024}}, 'split': 'test'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.legal_summarization_scenario.LegalSummarizationScenario', 'args': {'dataset_name': 'MultiLexSum', 'sampling_min_length': 100, 'sampling_max_length': 400, 'doc_max_length': 1024}}, 'split': 'train'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.legal_summarization_scenario.LegalSummarizationScenario', 'args': {'dataset_name': 'MultiLexSum', 'sampling_min_length': 100, 'sampling_max_length': 400, 'doc_max_length': 1024}}, 'split': 'valid'}}
536,537c572,573
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.open_assistant_scenario.OpenAssistantScenario', 'args': {'language': 'en'}}, 'split': 'train'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.open_assistant_scenario.OpenAssistantScenario', 'args': {'language': 'en'}}, 'split': 'valid'}}
---
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.newsqa_scenario.NewsQAScenario', 'args': {}}, 'split': 'train'}}
> {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.newsqa_scenario.NewsQAScenario', 'args': {}}, 'split': 'valid'}}
563d598
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.self_instruct_scenario.SelfInstructScenario', 'args': {}}, 'split': 'test'}}
570,629d604
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'ai21/j1'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'bigscience/bloom'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'bigscience/t0pp'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'cohere/cohere'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'eleutherai/gptj'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'eleutherai/gptneox'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'google/t5'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'google/ul2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'huggingface/gpt2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'meta/opt'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'tsinghua/glm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1, 'num_instances': 10, 'tokenizer': 'yandex/yalm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'ai21/j1'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'bigscience/bloom'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'bigscience/t0pp'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'cohere/cohere'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'eleutherai/gptj'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'eleutherai/gptneox'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'google/t5'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'google/ul2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'huggingface/gpt2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'meta/opt'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'tsinghua/glm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1024, 'num_instances': 10, 'tokenizer': 'yandex/yalm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'ai21/j1'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'bigscience/bloom'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'bigscience/t0pp'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'cohere/cohere'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'eleutherai/gptj'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'eleutherai/gptneox'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'google/t5'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'google/ul2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'huggingface/gpt2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'meta/opt'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'tsinghua/glm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 1536, 'num_instances': 10, 'tokenizer': 'yandex/yalm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'ai21/j1'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'bigscience/bloom'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'bigscience/t0pp'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'cohere/cohere'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'eleutherai/gptj'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'eleutherai/gptneox'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'google/t5'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'google/ul2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'huggingface/gpt2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'meta/opt'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'tsinghua/glm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 256, 'num_instances': 10, 'tokenizer': 'yandex/yalm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'ai21/j1'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'bigscience/bloom'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'bigscience/t0pp'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'cohere/cohere'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'eleutherai/gptj'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'eleutherai/gptneox'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'google/t5'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'google/ul2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'huggingface/gpt2'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'meta/opt'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'tsinghua/glm'}}, 'split': 'test'}}
< {'metadata': {'scenario_spec': {'class_name': 'helm.benchmark.scenarios.synthetic_efficiency_scenario.SyntheticEfficiencyScenario', 'args': {'num_prompt_tokens': 512, 'num_instances': 10, 'tokenizer': 'yandex/yalm'}}, 'split': 'test'}}
YianZhang commented 1 year ago

@andyzorigin