stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.76k stars 234 forks source link

How can I get efficiency metric for my huggingface local model? (llama2) #2756

Closed Mr-lonely0 closed 2 weeks ago

Mr-lonely0 commented 2 weeks ago

I have tested the standard usage process of HELM with the original llama-2. On the helm-server generated website, I noticed that there are no efficiency metrics recorded in the leaderboard presented by HELM. image

However, I found the Observed inference runtime (s) in the Predictions section for the corresponding dataset (cnn_dailymail as shown below). image

Could you please clarify how I can obtain the efficiency metrics if the former is correct? Alternatively, why is there only one count in the Predictions section where I set the --max-eval-instances=100 if the latter is correct?

yifanmai commented 2 weeks ago

In general, we no longer support computing denoised runtime. You can still used observed inference runtime though.

why is there only one count in the Predictions section

The count is from the number of "train trials" i.e. the number of trials the evaluation is run with a different selection of in context learning examples using a different random seed. This can be set using --num-train-trials and defaults to 1. The value you see is the mean for 100 instances in your single trial.

Mr-lonely0 commented 2 weeks ago

Thanks for your helpful relpy!