Closed vlievin closed 2 years ago
Hi @vlievin. Glad that you find our work interesting.
Regarding your question, it depends on the domain of the downstream dataset in which you plan to do your research on.
For example, our paper evaluates our models on the BioASQ dataset. This dataset is extracted from PubMed Abstracts, which has more medical terms and is more domain-constrained than PubMed Full Articles. That's why both BioM-ELECTRA and BioM-ALBERT perform very well on BioASQ. On the other hand, BioM-BERT and BioM-ALBERT-PMC show lower performance since these two models were pre-trained on PMC full article, which may consider out-of-domain corpora for BioASQ (PubMed abstract). Although all these models follow the transfer learning concept and should work on different downstream tasks, the out-of-domain issue still exists.
I just have a quick review of both MedQA and MedMCQA, and it seems that MedQA uses clinical records as the source of its dataset. Facebook Bio-LM uses a clinical dataset in one of its variations. (see this link https://aclanthology.org/attachments/2020.clinicalnlp-1.17.OptionalSupplementaryMaterial.pdf) . I suggest to you to add this model to your list of evaluation:+ PM + M3 (MIMIC-III) + Voc base or large. I am not sure what the size of the MIMIC-III clinical dataset is, but if it's over 15GB, you may want to try pre-train new models on MIMIC-III alone and see how it performs downstream clinical tasks such as MedQA. . If it's below 15GB, it may not be enough to create a language model that performs well in downstream tasks. I think you also need to check another model called clincalbert .
Regarding your question, I think it would be better to design your experimental setup to use BioM-ELECTRA and PM + M3 + Voc . in our latest participation at BioASQ10 2022, BioM-ELECTRA (UDEL-3 and UDEL-lab4) showed better performance. Results are here http://participants-area.bioasq.org/results/10b/phaseB/. Sort results by MRR, which is the official metric for Factoid. BioM-ELECTRA also has less fine-tuning and inference time (0.3x) since it has less hidden layer size (1024) compared to ALBERT (4096).
Also, I think you need to add the ELECTRA model, which was pre-trained on the English domain, to your baseline. These models may perform better than biomedical language models if datasets used to create both MedQA and MedMCQA are closer to the general domain. It's better to think about out-of-domain and in-domain as a characteristic of the dataset rather than which rational domain we think it belongs to in our mind interpretation (e.g. biomedical, finance ). For example, both ALBERT and T5 models show that the performance decreased on SQuAD dataset when we introduced the XLNET dataset to both of them, even though the XLNET dataset uses formal English ( collection of News articles), and it increased the dataset from ~18GB (Wikipedia+Books) to 120GB. This is because the SQuAD dataset came mainly from Wikipedia corpora, so when we introduce out-of-the-domain dataset, performance is negatively impacted. see page 8 from ALBERT paper https://arxiv.org/pdf/1909.11942.pdf.
Regarding your last point, you can work with PyTorch XLA to fine-tune all models posted here https://huggingface.co/models. Pytorch XLA uses the exact same torch code that we use to fine-tune our model on GPU and allows us to use it with the TPU unit. We show an example of how to use PyTorch XLA with the Text Classification problem. use this script with PyTorch XLA: https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py . Make sure to take the average of 5 different runs to overcome the fluctuation between each run. If a dataset is small, take the average of ten different runs. seed setting in torch code may not help you to stable the results since it still uses TPU and graphs. I strongly suggest using the original implementation of ELECTRA and ALERT if you can use since it has a layer-wise decay feature that improves results on QA tasks. Make sure to fine-tune any biomedical model on SQuAD first and then on BioASQ and finally on MedQA and MedMCQA. You may better use the BioASQ10 Training set that was released this year since it has a larger size. However, don't evaluate BioASQ10B on the BioASQ9B test dataset because I think every year BioASQ team adds the test dataset to the training set for the next year.
You can also get access to free TPU unit from google using this link https://sites.research.google/trc/about/ . These TPUs are TPUv3-8 which is better than TPUv-2 that Google Colab uses since it has 128GB memory againest 64GB for TPUv-2. Also check Nvidia page here https://mynvidia.force.com/HardwareGrant/s/Application. They may give you access to free GPUs.
Thank You Sultan
-ps I just created a new example that shows how to fine-tune BioM-ELECTRA-large on the SQuAD dataset with Torch XLA code on TPU unit. https://github.com/salrowili/BioM-Transformers/blob/main/examples/Fine_Tuning_BioM_Transformers_on_SQuAD_on_TPU_with_PyTorch_XLA.ipynb This example is much easier to implement than the native TensorFlow example here https://colab.research.google.com/github/salrowili/BioM-Transformers/blob/main/examples/Example_of_SQuAD2_0_and_BioASQ7B_tasks_with_BioM_ELECTRA_Large_on_TPU.ipynb , since it does not require the Google Bucket part. However, the native TensorFlow has a Layer-wise decay feature which improves the score a little bit.
Hi @salrowili, thank you so much for the detailed answer. This is so valuable.
Thank you for the summary of the different models and the in/out of domain fine-tuning discussion, that makes a lot of sense. You made it clear how important it is to stay in-domain when fine-tuning.
Both MedQA and MedMCQA are built on medical entrance exam questions. Questions are mostly focused on medical cases, and the problem is to find the most relevant diagnosis. In particular in MedQA, the questions are long and include a long description of the case, which might be a good fit for clinical models (ClinicalBERT). However, in other cases, questions are general medical knowledge questions (e.g. ethical questions, methodological questions). For these ones, I think a general domain model or a model trained on PubMed abstracts is better. In MedMCQA, the questions are shorter and more factual. I am working with both datasets in an open-book setting using Wikipedia as a knowledge base. In that context, I think a more general domain model will work better. So you are right that ELECTRA and BIOM-ELECTRA seem to be excellent candidates for this task.
Regarding the training details and tricks, thanks a lot. You probably saved me weeks of trials and errors for the bioASQ dataset. I should probably apply the same tricks to my current setup. Regarding the TRC program, I have already used it for a previous project, so I will have to do with the compute I have (which is consequent but not infinite). I am using the Huggingface version of your models and have my codebase implemented with PyTorch Lightning, so far it worked well when different types of devices.
Thank you again Sultan for the detailed reply, that is super helpful. Let see what BiIOM-ELECTRA can do :) All the best, Valentin
Thanks, Valentin for your words, and good luck with your research. Just as an update, we notice the new release of Transformers causes TPU XLA implementation to not work probably. Thus, to fix this issue use this code:
!pip3 install git+https://github.com/huggingface/transformers.git@v4.19.4
!git clone --depth 1 --branch v4.19.4 https://github.com/huggingface/transformers
instead of
!pip3 install git+https://github.com/huggingface/transformers
!git clone https://github.com/huggingface/transformers
We will update our examples with PyTorch XLA to use a fixed version of Transformers to avoid this issue in the future.
Sultan
Hello, first of all, great repo and great paper!
I am working with medical OpenQA (MedQA and MedMCQA), and so far I have been using PubMedBERT for development (it works fine for its size), and I am now looking for scaling up the backbone. Your models look like the answer.
Which model(s) would you advise using first for an OpenQA task? I don't have infinite compute and I might not be able to try them all. Also, do you know how your models would compare with the work from Facebook's bio-LM?
Exciting work, looking forward to reading from you!