pan-x-c / EE-LLM

EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).
Other
44 stars 4 forks source link

[QUESTION] How can I convert checkpoint tuned by EE-Tuning to Huggingface format? #15

Open Mr-lonely0 opened 3 months ago

Mr-lonely0 commented 3 months ago

I have fine-tuned the llama-7b model using EE-Tuning, and I now need to convert the checkpoint to the Hugging Face format to proceed with the evaluation process. How should I do this?

pan-x-c commented 3 months ago

Same as #10 and #7. There is no way to convert the checkpoint to the huggingface format currently.

Mr-lonely0 commented 3 months ago

Thanks for your information!

I am also curious about how I can reproduce the results demonstrated in your paper and perform the downstream evaluation on the HELM benchmark. Could you please provide more details on this?

pan-x-c commented 3 months ago

We modified the MegatronClient, adding parameters related to EE-LLM. All other parts are directly inherited from HELM.

Mr-lonely0 commented 3 months ago

Actually, I'm not familiar with HELM. Could you provide some demos or guidance on how to use the script MegatronClient?

pan-x-c commented 3 months ago

You can refer to the demo in data-juicer. Note that HELM itself is a heavy evaluation framework, and there are many difficulties in its installation and usage. You may need to go to the Helm official repository for help

Mr-lonely0 commented 3 months ago

Really appreciate for your help! I'll check the demo you mentioned and give it a try. Thanks again for your time!

Mr-lonely0 commented 3 months ago

Hello again!

I have tried the evaluation framework proposed in data-juicer and get some benchmark results, such as ROUGE-2 in CNN/DM, F1 score in NarrativeQA, and EM in MMLU. However, I'm confused about how can I get efficiency results like inference time throughout the generation process.

What should I modify in mymodel_example.yaml to parse the corresponding metric from HELM output?

I would greatly appreciate your help and look forward to your prompt response.

pan-x-c commented 3 months ago

If you use the HELM provided by Data-Juicer, you can modify src/helm/benchmark/static/schema.yaml to adjust the metrics. For example, we modified the efficiency item to:

  - name: efficiency
    display_name: Efficiency
    metrics:
    - name: inference_runtime
      split: ${main_split}

The inference_runtime is the metric used in our paper.

In addition, you also need to modify your megatron_client.py to return the new metric in your response. For example,

        return RequestResult(
            success=True,
            cached=cached,
            request_time=response['request_time'],
            request_datetime=response['request_datetime'],
            completions=completions,
            embedding=[]
        )

Helm will use the request_time field to calculate the inference_time metric.

Note that the demo script provided by Data-Juicer is not for EE models, it only records some metrics for pretraining. To view the full evaluation result, you should refer to the standard usage process of HELM, e.g. helm-server after helm-summarize.

Mr-lonely0 commented 3 months ago

Thank you!

I have tested the standard usage process of HELM with the original llama-2. On the helm-server generated website, I noticed that there are no efficiency metrics recorded in the leaderboard presented by HELM. image

However, I found the Observed inference runtime (s) in the Predictions section for the corresponding dataset (cnn_dailymail as shown below). image

Could you please clarify how I can obtain the efficiency metrics if the former is correct? Alternatively, why is there only one count in the Predictions section where I set the --max-eval-instances=100 if the latter is correct?

pan-x-c commented 2 months ago

Your client must return those metrics in response, and then HELM can summarize them, so you need to modify your client first, as shown in my previous comment. For example, HELM will use request_time field in the response to calculate the inference_runtime metric.

And in our paper's experiment, we set --max-eval-instances to 500.

Mr-lonely0 commented 2 months ago

Thanks! I have figured it out. Really appreciate your time!

github-actions[bot] commented 4 weeks ago

Marking as stale. No activity in 60 days.