Extracting Medical Entities from Social Media

Model Diagram

Repostiory for the paper Extracting Medical Entities from Social Media

The dataset released in this project is called MedRed and can be accessed on FigShare. The resulting model from this project is called MedDL and can be accessed on a specific GitHub repo.

More info on the project page.

Requirements

numpy
pandas
seaborn
flair
tqdm
spacy
xgboost
sklearn

In this project, we implemented a deep learning method for medical entity extraction from social media text. In the natural language processing terminology (NLP), this is called medical Named Entity Recognition (NER). The method is based on the BiLSTM+CRF deep learning architecture using RoBERTa contextual embeddings in combination with GloVe word embeddings.

We also created a novel labelled dataset for medical entity extraction called MedRed (from Reddit). Then we evaluated the method on two existing datasets: CADEC (from AskAPatient) and Micromed (from Twitter), a well as on MedRed (from Reddit).

Finally, to validate the method on a large scale, we applied it on half a million Reddit posts from disease-specific subreddits (such as r/psoriasis and r/bpd). Then we shown that the disease topic of each post can be predicted with a high accurracy solely from the extracted medical entities by our method.

Structure

code
- train contains the scripts to create the flair corpus from a given dataset and the for running the training models.
- evaluation contains the scripts to evaluate the trained models on each of the 3 labelled datasets.
- validation contains the scripts for applying the trained models on other datasets, and for disease prediction on Reddit from the extracted posts.
- prep contains the scripts for pre processing text into the labelled format suitable for flair.
data MedRed and Reddit can be downloaded from FigShare data. Others (i.e., CADEC and Micomed) are avilable from the respective publications.
- Reddit
- CADEC
- Micromed
- MedRed
resources the resulting pretrained models can also be found on FigShare models.
- model
results running the scripts will save the results in these folders.
- NER_res
- CADEC
- Micromed
- MedRed
- Reddit

We used and thank the Flair library by Zalando Research.

Licence

This project is licensed under the MIT License - see the LICENSE.md file for details

sanja7s / MedRed

readme

Extracting Medical Entities from Social Media

Repostiory for the paper Extracting Medical Entities from Social Media

Requirements

Structure

Licence