Closed Mr-lonely0 closed 2 weeks ago
In general, we no longer support computing denoised runtime. You can still used observed inference runtime though.
why is there only one count in the Predictions section
The count is from the number of "train trials" i.e. the number of trials the evaluation is run with a different selection of in context learning examples using a different random seed. This can be set using --num-train-trials
and defaults to 1. The value you see is the mean for 100 instances in your single trial.
Thanks for your helpful relpy!
I have tested the standard usage process of HELM with the original llama-2. On the helm-server generated website, I noticed that there are no efficiency metrics recorded in the leaderboard presented by HELM.![image](https://github.com/pan-x-c/EE-LLM/assets/67233215/3d39a00e-dd1f-47f9-93fb-81bec1c23567)
However, I found the![image](https://github.com/pan-x-c/EE-LLM/assets/67233215/995716a3-448f-4f35-a76e-588d6653e7f2)
Observed inference runtime (s)
in the Predictions section for the corresponding dataset (cnn_dailymail as shown below).Could you please clarify how I can obtain the efficiency metrics if the former is correct? Alternatively, why is there only one count in the Predictions section where I set the
--max-eval-instances=100
if the latter is correct?