stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.93k stars 247 forks source link

Unable to run evaluation for med_QA dataset #2598

Closed richardzhuang0412 closed 3 months ago

richardzhuang0412 commented 6 months ago

I was testing evaluation using this code:

6341714626069_ pic

However this error occurs: "helm.benchmark.runner.RunnerError: Failed runs: ["med_qa:model=NousResearch_Meta-Llama-3-8B"]". I was able to run evaluation for GSM8k using the same command with "med_qa" replaced to be "gsm". Did I do something wrong?

yifanmai commented 6 months ago

Could you provide the complete logs from your run (as a shared file, a file attachment or GitHub Gist)?

richardzhuang0412 commented 6 months ago

Here is the log:

/data/richard/helm (main) » helm-run \ tianhao@sn4622117596 --run-entries med_qa:model=NousResearch/Meta-Llama-3-8B \ --enable-huggingface-models NousResearch/Meta-Llama-3-8B \ --suite v1 \ --max-eval-instances 10

main { Reading tokenizer configs from /data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/config/tokenizer_configs.yaml... Reading model deployments from /data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/config/model_deployments.yaml... Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Registered default metadata for model NousResearch/Meta-Llama-3-8B 1 entries produced 1 run specs run_specs { RunSpec(name='med_qa:model=NousResearch_Meta-Llama-3-8B', scenario_spec=ScenarioSpec(class_name='helm.benchmark.scenarios.med_qa_scenario.MedQAScenario', args={}), adapter_spec=AdapterSpec(method='multiple_choice_joint', global_prefix='', global_suffix='', instructions='The following are multiple choice questions (with answers) about medicine.\n', input_prefix='Question: ', input_suffix='\n', reference_prefix='A. ', reference_suffix='\n', output_prefix='Answer: ', output_suffix='\n', instance_prefix='\n', substitutions=[], max_train_instances=5, max_eval_instances=10, num_outputs=5, num_train_trials=1, num_trials=1, sample_train=True, model_deployment='NousResearch/Meta-Llama-3-8B', model='NousResearch/Meta-Llama-3-8B', temperature=0.0, max_tokens=1, stop_sequences=['\n'], random=None, multi_label=False, image_generation_parameters=None, eval_splits=None), metric_specs=[MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.BasicGenerationMetric', args={'names': ['exact_match', 'quasi_exact_match', 'prefix_exact_match', 'quasi_prefix_exact_match']}), MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.BasicReferenceMetric', args={}), MetricSpec(class_name='helm.benchmark.metrics.basic_metrics.InstancesPerSplitMetric', args={})], data_augmenter_spec=DataAugmenterSpec(perturbation_specs=[], should_augment_train_instances=False, should_include_original_train=False, should_skip_unchanged_train=False, should_augment_eval_instances=False, should_include_original_eval=False, should_skip_unchanged_eval=False, seeds_per_instance=1), groups=['med_qa'], annotators=None) } [0.0s] Running in local mode with base path: prod_env Looking in path: prod_env AutoTokenizer: cache_backend_config = SqliteCacheBackendConfig(path='prod_env/cache') AutoClient: file_storage_path = prod_env/cache AutoClient: cache_backend_config = SqliteCacheBackendConfig(path='prod_env/cache') AutoTokenizer: cache_backend_config = SqliteCacheBackendConfig(path='prod_env/cache') Found 1 account(s). Looking in path: prod_env AnnotatorFactory: file_storage_path = prod_env/cache AnnotatorFactory: cache_backend_config = SqliteCacheBackendConfig(path='prod_env/cache') 0%| | 0/1 [00:00<?, ?it/s] Running med_qa:model=NousResearch_Meta-Llama-3-8B { scenario.get_instances { ensure_file_downloaded { } [0.0s] } [0.0s] } [0.002s] Error when running med_qa:model=NousResearch_Meta-Llama-3-8B: Traceback (most recent call last): File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/common/general.py", line 89, in ensure_file_downloaded import gdown # noqa ModuleNotFoundError: No module named 'gdown'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/runner.py", line 216, in run_all self.run_one(run_spec) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/runner.py", line 255, in run_one instances = scenario.get_instances(scenario_output_path) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/scenarios/med_qa_scenario.py", line 63, in get_instances ensure_file_downloaded( File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/common/hierarchical_logger.py", line 104, in wrapper return fn(*args, **kwargs) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/common/general.py", line 91, in ensure_file_downloaded handle_module_not_found_error(e, ["scenarios"]) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/common/optional_dependencies.py", line 14, in handle_module_not_found_error raise OptionalDependencyNotInstalled( helm.common.optional_dependencies.OptionalDependencyNotInstalled: Optional dependency gdown is not installed. Please run pip install crfm-helm[scenarios] or pip install crfm-helm[all] to install it.

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.08it/s] } [7.004s] Traceback (most recent call last): File "/data/tianhao/miniconda3/envs/crfm-helm/bin/helm-run", line 8, in sys.exit(main()) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/common/hierarchical_logger.py", line 104, in wrapper return fn(*args, **kwargs) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/run.py", line 321, in main run_benchmarking( File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/run.py", line 125, in run_benchmarking runner.run_all(run_specs) File "/data/tianhao/miniconda3/envs/crfm-helm/lib/python3.8/site-packages/helm/benchmark/runner.py", line 225, in run_all raise RunnerError(f"Failed runs: [{failed_runs_str}]") helm.benchmark.runner.RunnerError: Failed runs: ["med_qa:model=NousResearch_Meta-Llama-3-8B"]

It seems that it's the dependency problem but I was not able to run any of the command like "pip install crfm-helm[scenarios] or pip install crfm-helm[all]"

richardzhuang0412 commented 6 months ago

I tried pip install gdown and it seems to work so I guess the problem is solved. But could you let me know how does pip install crfm-helm[scenarios] or pip install crfm-helm[all] works?

yifanmai commented 6 months ago

Could you provide the logs from running pip install crfm-helm[scenarios] within on your shell with your conda environment activated? I would expect that the command should "just work".

richardzhuang0412 commented 6 months ago
image
yifanmai commented 6 months ago

For zsh, could you try instead running:

pip install 'crfm-helm[scenarios]'

(with the single quotes)

richardzhuang0412 commented 6 months ago

Oh yes that is working. Thank you so much!

richardzhuang0412 commented 6 months ago

Hi Yifan,

Do you know what I should do if I want to increase evaluation speed by running parallel inference on multiple GPUs?

For example, I am using the command "helm-run \ --run-entries boolq:model=NousResearch/Meta-Llama-3-8B \ --enable-huggingface-models NousResearch/Meta-Llama-3-8B \ --suite v1 \ --max-eval-instances 10" right now.

And I am unable to run for example llama-3-70b for now. Even if I specify CUDA_AVAILABLE_DEVICES=0,1,2,3,4,5,6,7 it still gives CUDA 0 OOM error.

yifanmai commented 3 months ago

In general, we don't support parallel inference in HELM. Sorry about that.