obi-ml-public / ehr_deidentification

Robust de-identification of medical notes using transformer architectures
MIT License
42 stars 10 forks source link
deidentification medical-notes

Robust DeID: De-Identification of Medical Notes using Transformer Architectures

DOI

This repository was used to train and evaluate various de-identification models and strategies on medical notes from the I2B2-DEID dataset and the MassGeneralBrigham (MGB) network. The models and strategies are extensible and can be used on other datasets as well. Trained models are published on huggingface under the OBI organization.

Main features are:

  1. Transformer models - Any transformer model from the huggingface library can be used for training. We make available a RoBERTa Liu et al., 2019 model and a ClinicalBERT Alsentzer et al., 2019 model fine-tuned for de-identification on huggingface: obi_roberta_deid, obi_bert_deid. Both can be used for testing (forward pass).
  2. Recall biased thresholding - Ability to use classification bias to aggressively remove PHI from notes. This is a safer and more robust option when working with sensitive data like medical notes.
  3. Custom clinical tokenizer - Includes 60 regular expressions based on the structure and information generally found in medical notes. This tokenizer resolves common typographical errors and missing spaces that occur in clinical notes.
  4. Context enhancement - Option to add on additional tokens to a given sequence as context on the left and right. These tokens can be used only as context, or we can also train on these tokens (which essentially mimics a sliding window approach). The reason for including context tokens was to provide additional context especially for peripheral tokens in a given sequence.

Since de-identification is a sequence labeling task, this tool can be applied to any other sequence labeling task as well.\ More details on how to use this tool, the format of data and other useful information is presented below.

Comments, feedback and improvements are welcome and encouraged!

Dataset Annotations

Installation

Dependencies

conda env create -f deid.yml
conda activate deid

Robust De-Id

Data Format

Training

Test (Forward Pass/Inference)

Usage

Test (Forward Pass/Inference)

Training

Evaluation

Recall biased thresholding

Trained Models