Closed yiyiwwang closed 5 months ago
Hi,
Regarding the evaluation
You can implement the evaluation code using the lm-eval-harness framework. We have provided pre-templatized input instructions and output completions for each domain-specific task on Hugging Face:
Biomedicine tasks Finance tasks Law tasks
For multiple-choice tasks (including RCT, ChemProt, MQP, USMLE, PubMedQA, Headline, FPB, FiQA_SA, SCOTUS, CaseHold, UnfairToS), you can follow any multiple-choice task (e.g., SIQA) in lm-eval-harness. A helpful guideline is available here.
For text completion tasks (including ConvFinQA, NER), you can follow the example of text completion tasks like SQuADv2.
Regarding the repetition in the datasets
Thank you for your careful review! The raw ChemProt dataset we used is from the DAPT repository. We had not noticed this issue before, but the repetition is likely acceptable for our experiments. Therefore, you do not need to remove the repetitions.
Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.
Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?
Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows: The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?
Hi!Can you please give me the full code you used to evaluate this model?
Hi, we are currently working on the next version of AdaptLLM at Instruction-Pretrain, with plans to release the evaluation code by then, please stay tuned.
Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.
Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?
Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows: The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?