microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.39k stars 253 forks source link

[AdaptLLM] How to evaluate the models' performance? #223

Closed yiyiwwang closed 1 week ago

yiyiwwang commented 1 month ago

Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.

Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?

Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows: image The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?

cdxeve commented 1 month ago

Hi,

Regarding the evaluation

You can implement the evaluation code using the lm-eval-harness framework. We have provided pre-templatized input instructions and output completions for each domain-specific task on Hugging Face:

Biomedicine tasks Finance tasks Law tasks

For multiple-choice tasks (including RCT, ChemProt, MQP, USMLE, PubMedQA, Headline, FPB, FiQA_SA, SCOTUS, CaseHold, UnfairToS), you can follow any multiple-choice task (e.g., SIQA) in lm-eval-harness. A helpful guideline is available here.

For text completion tasks (including ConvFinQA, NER), you can follow the example of text completion tasks like SQuADv2.

Regarding the repetition in the datasets

Thank you for your careful review! The raw ChemProt dataset we used is from the DAPT repository. We had not noticed this issue before, but the repetition is likely acceptable for our experiments. Therefore, you do not need to remove the repetitions.

Amireux0000 commented 2 weeks ago

Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.

Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?

Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows: image The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?

Hi!Can you please give me the full code you used to evaluate this model?

cdxeve commented 2 weeks ago

Hi, we are currently working on the next version of AdaptLLM at Instruction-Pretrain, with plans to release the evaluation code by then, please stay tuned.