The dataset released in this project is called MedRed and can be accessed on FigShare. The resulting model from this project is called MedDL and can be accessed on a specific GitHub repo.
More info on the project page.
numpy
pandas
seaborn
flair
tqdm
spacy
xgboost
sklearn
In this project, we implemented a deep learning method for medical entity extraction from social media text. In the natural language processing terminology (NLP), this is called medical Named Entity Recognition (NER). The method is based on the BiLSTM+CRF deep learning architecture using RoBERTa contextual embeddings in combination with GloVe word embeddings.
We also created a novel labelled dataset for medical entity extraction called MedRed (from Reddit). Then we evaluated the method on two existing datasets: CADEC (from AskAPatient) and Micromed (from Twitter), a well as on MedRed (from Reddit).
Finally, to validate the method on a large scale, we applied it on half a million Reddit posts from disease-specific subreddits (such as r/psoriasis and r/bpd). Then we shown that the disease topic of each post can be predicted with a high accurracy solely from the extracted medical entities by our method.
code
train
contains the scripts to create the flair corpus from a given dataset and the for running the training models.evaluation
contains the scripts to evaluate the trained models on each of the 3 labelled datasets.validation
contains the scripts for applying the trained models on other datasets, and for disease prediction on Reddit from the extracted posts.prep
contains the scripts for pre processing text into the labelled format suitable for flair.data
MedRed and Reddit can be downloaded from FigShare data. Others (i.e., CADEC and Micomed) are avilable from the respective publications.
Reddit
CADEC
Micromed
MedRed
resources
the resulting pretrained models can also be found on FigShare models.
model
results
running the scripts will save the results in these folders.
NER_res
CADEC
Micromed
MedRed
Reddit
We used and thank the Flair library by Zalando Research.
This project is licensed under the MIT License - see the LICENSE.md file for details