This repository includes code for Named Entity Recognition and Relationship Extraction methods and knowledge graph generation through EHR records. These methods were performed on n2c2 2018 challenge dataset which was augmented to include a sample of ADE corpus dataset. This project is capstone project for my undergraduate degree in Bachelors of Technology (Computer Science and Engineering).
The purpose of this project is to automatically structure this data into a format that would enable doctors and patients to quickly find information that they need. Specifically, build a Named Entity Recognition (NER) model that would recognize entities such as drug, strength, duration, frequency, adverse drug event (ADE), reason for taking the drug, route and form. Further, the model would also recognize the relationship between drugs and every other named entity as well and generate a knowledge graph based on it so as to make it easier for the doctors to analyze the patient’s disease and drug history at a quick glance. The model would also have the feature of query answering wherein the knowledge graph will be used to answer the user queries.
The main objective of the project is to use the extracted relationships between drugs and every other entity to build a comprehensive knowledge graph which could be used for providing quick summary, query answering and analysis, thus simplifying knowledge discovery in the biomedical field
To run this project locally you need to get the datasets from the links mentioned below and preprocess the datasets to generate the dataset for training and testing of NER and RE models. Also you will need to have an Neo4J account for knowledge graph generation.
https://huggingface.co/datasets/ade_corpus_v2
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
https://neo4j.com/
pip install -r requirements.txt
python generate_data.py
--task ner
--input_dir data
--ade_dir ade_corpus
--target_dir dataset
--max_seq_len 512
--dev_split 0.1
--tokenizer biobert-base
--ext txt
--sep " "
export SAVE_DIR=./output
export DATA_DIR=./dataset
export MAX_LENGTH=128 export BATCH_SIZE=16 export NUM_EPOCHS=5 export SAVE_STEPS=1000 export SEED=0
python run_ner.py --data_dir ${DATA_DIR} --labels ${DATA_DIR}/labels.txt --model_name_or_path dmis-lab/biobert-large-cased-v1.1 --output_dir ${SAVE_DIR} --max_seq_length ${MAX_LENGTH} --num_train_epochs ${NUM_EPOCHS} --per_device_train_batch_size ${BATCH_SIZE} --save_steps ${SAVE_STEPS} --seed ${SEED} --do_train --do_eval --do_predict --overwrite_output_dir
* RE Model
```sh
export SAVE_DIR=./output
export DATA_DIR=./dataset
export MAX_LENGTH=128
export BATCH_SIZE=8
export NUM_EPOCHS=3
export SAVE_STEPS=1000
export SEED=1
export LEARNING_RATE=5e-5
python run_re.py
--task_name ehr-re
--config_name bert-base-cased
--data_dir ${DATA_DIR}
--model_name_or_path dmis-lab/biobert-base-cased-v1.1
--max_seq_length ${MAX_LENGTH}
--num_train_epochs ${NUM_EPOCHS}
--per_device_train_batch_size ${BATCH_SIZE}
--save_steps ${SAVE_STEPS}
--seed ${SEED}
--do_train
--do_eval
--do_predict
--learning_rate ${LEARNING_RATE}
--output_dir ${SAVE_DIR}
--overwrite_output_dir
uvicorn fast_api:app --reload
To show the operation of Named Entity Recognition (NER), Relationship Table and Knowledge Graph, a web app was developed using HTML, CSS, and JavaScript. A graphical user interface (GUI) is displayed, in which the user needs to upload an EHR from which entities, relationships have been identified based on which Knowledge graph is created. The retrieved entities can be viewed as a result. The relationship between retrieved entities can be viewed as a result. The Knowledge graph generated can be viewed as a result .
The uploaded ehr's data is stored in the Neo4J graph database shown in the following figure.
The example for query-answering is shown in the following image.
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Distributed under the MIT License. See LICENSE for more information.