Open Mr-lonely0 opened 3 months ago
Same as #10 and #7. There is no way to convert the checkpoint to the huggingface format currently.
Thanks for your information!
I am also curious about how I can reproduce the results demonstrated in your paper and perform the downstream evaluation on the HELM benchmark. Could you please provide more details on this?
We modified the MegatronClient, adding parameters related to EE-LLM. All other parts are directly inherited from HELM.
Actually, I'm not familiar with HELM. Could you provide some demos or guidance on how to use the script MegatronClient?
You can refer to the demo in data-juicer. Note that HELM itself is a heavy evaluation framework, and there are many difficulties in its installation and usage. You may need to go to the Helm official repository for help
Really appreciate for your help! I'll check the demo you mentioned and give it a try. Thanks again for your time!
Hello again!
I have tried the evaluation framework proposed in data-juicer and get some benchmark results, such as ROUGE-2 in CNN/DM, F1 score in NarrativeQA, and EM in MMLU. However, I'm confused about how can I get efficiency results like inference time throughout the generation process.
What should I modify in mymodel_example.yaml to parse the corresponding metric from HELM output?
I would greatly appreciate your help and look forward to your prompt response.
If you use the HELM provided by Data-Juicer, you can modify src/helm/benchmark/static/schema.yaml
to adjust the metrics. For example, we modified the efficiency
item to:
- name: efficiency
display_name: Efficiency
metrics:
- name: inference_runtime
split: ${main_split}
The inference_runtime
is the metric used in our paper.
In addition, you also need to modify your megatron_client.py to return the new metric in your response. For example,
return RequestResult(
success=True,
cached=cached,
request_time=response['request_time'],
request_datetime=response['request_datetime'],
completions=completions,
embedding=[]
)
Helm will use the request_time
field to calculate the inference_time
metric.
Note that the demo script provided by Data-Juicer is not for EE models, it only records some metrics for pretraining.
To view the full evaluation result, you should refer to the standard usage process of HELM, e.g. helm-server
after helm-summarize
.
Thank you!
I have tested the standard usage process of HELM with the original llama-2. On the helm-server generated website, I noticed that there are no efficiency metrics recorded in the leaderboard presented by HELM.
However, I found the Observed inference runtime (s)
in the Predictions section for the corresponding dataset (cnn_dailymail as shown below).
Could you please clarify how I can obtain the efficiency metrics if the former is correct? Alternatively, why is there only one count in the Predictions section where I set the --max-eval-instances=100
if the latter is correct?
Your client must return those metrics in response, and then HELM can summarize them, so you need to modify your client first, as shown in my previous comment.
For example, HELM will use request_time
field in the response to calculate the inference_runtime
metric.
And in our paper's experiment, we set --max-eval-instances
to 500.
Thanks! I have figured it out. Really appreciate your time!
Marking as stale. No activity in 60 days.
I have fine-tuned the llama-7b model using EE-Tuning, and I now need to convert the checkpoint to the Hugging Face format to proceed with the evaluation process. How should I do this?