stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.77k stars 235 forks source link

Fix the zero instance problem in DecodingTrust #2734

Closed danielz02 closed 3 weeks ago

danielz02 commented 4 weeks ago

Set split=TEST_SPLIT during evaluation instance generation so that all prompts will be retained during filtering.

yifanmai commented 4 weeks ago

Lint error - could you please fix?

--- /home/runner/work/helm/helm/src/helm/benchmark/scenarios/decodingtrust_stereotype_bias_scenario.py  2024-06-12 02:41:09.858994+00:00
+++ /home/runner/work/helm/helm/src/helm/benchmark/scenarios/decodingtrust_stereotype_bias_scenario.py  2024-06-[12](https://github.com/stanford-crfm/helm/actions/runs/9475862261/job/26107842381?pr=2734#step:7:13) 02:46:03.230250+00:00
@@ -59,10 +59,10 @@
                         Reference(
                             Output(text=stereotype_topic_tag + " " + demographic_group_tag + " " + sys_prompt_type_tag),
                             tags=[stereotype_topic_tag, demographic_group_tag, sys_prompt_type_tag],
                         )
                     ],
-                    split=TEST_SPLIT
+                    split=TEST_SPLIT,
                 )
                 instances.append(instance)

         return instances