This is the code that was used of the paper : UmlsBERT: Augmenting Contextual Embeddings with a Clinical Metathesaurus (NAACL 2021).
In this work, we introduced UmlsBERT, a contextual embedding model capable of integrating domain knowledge during pre-training. It was trained on biomedical corpora and uses the Unified Medical Language System (UMLS) clinical metathesaurus in two ways:
This project was created with python 3.7 and PyTorch 0.4.1 and it is based on the transformer github repo of the huggingface team
We recommend installing and running the code from within a virtual environment.
First, download Anaconda from this link
Second, create a conda environment with python 3.7.
$ conda create -n umlsbert python=3.7
Upon restarting your terminal session, you can activate the conda environment:
$ conda activate umlsbert
In the project root directory, run the following to install the required packages.
pip3 install -r requirements.txt
If you start a VM, please run the following command sequentially before install the required python packages. The following code example is for a vast.ai Virtual Machine.
apt-get update
apt install git-all
apt install python3-pip
apt-get install jupyter
In order to use pre-trained UmlsBERT model for the word embeddings (or the semantic embeddings), you need to dowload it into the folder examples/checkpoint/ from the link:
wget -O umlsbert.tar.xz https://www.dropbox.com/s/kziiuyhv9ile00s/umlsbert.tar.xz?dl=0
into the folder examples/checkpoint/ and unzip it with the following command:
tar -xvf umlsbert.tar.xz
The UmlsBERT was pretrained on the MIMIC data. Unfortunately, we cannot provide the text of the MIMIC III dataset as training course is mandatory in order to access the particular dataset.
The MIMIC III dataset can be downloaded from the following link
The pretraining an UmlsBERT model depends on data from NLTK so you'll have to download them. Run the Python interpreter (python3) and type the commands:
>>> import nltk
>>> nltk.download('punkt')
After downloading the NOTEEVENTS table in the examples/language-modeling/ folder, run the following python code that we provide in the examples/language-modeling/ folder to create the mimic_string.txt on the folder examples/language-modeling/:
python3 mimic.py
you can pre-trained a UmlsBERT model by running the following command on the examples/language-modeling/:
Example for pretraining Bio_clinicalBert:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt
MedNLi task
MedNLI is available through the MIMIC-III derived data repository. Any individual certified to access MIMIC-III can access MedNLI through the following link
python3 mednli.py
or directly run UmlsBert on the text-classification/ folder:
python3 run_glue.py --output_dir ./models/medicalBert-v1 --model_name_or_path ../checkpoint/umlsbert --data_dir dataset/mednli/mednli --num_train_epochs 3 --per_device_train_batch_size 32 --learning_rate 1e-4 --do_train --do_eval --do_predict --task_name mnli --umls --med_document ./voc/vocab_updated.txt
NER task
Due to the copyright issue of i2b2 datasets, in order to download them follow the link.
We provide the code for converting the i2b2 dataset with the following instruction for each dataset:
i2b2 2006:
i2b2 2010:
i2b2 2012:
i2b2 2014:
We provide an example-notebook under the folder experiements/:
or directly run UmlsBert on the token-classification/ folder:
python3 run_ner.py --output_dir ./models/medicalBert-v1 --model_name_or_path ../checkpoint/umlsbert --labels dataset/NER/2006/label.txt --data_dir dataset/NER/2006 --do_train --num_train_epochs 20 --per_device_train_batch_size 32 --learning_rate 1e-4 --do_predict --do_eval --umls --med_document ./voc/vocab_updated.txt
If you find our work useful, can cite our paper using:
@inproceedings{michalopoulos-etal-2021-umlsbert,
title = "{U}mls{BERT}: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the {U}nified {M}edical {L}anguage {S}ystem {M}etathesaurus",
author = "Michalopoulos, George and
Wang, Yuanxin and
Kaka, Hussam and
Chen, Helen and
Wong, Alexander",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.139",
doi = "10.18653/v1/2021.naacl-main.139",
pages = "1744--1753",
}