ELM4PSIR - Exploring Language Modelling for (NHS) Patient Safety Incident Reports - DART PhD Internship Project
NHS England - Digital Analytics and Research Team (DART) - PhD Internship Project

About the Project

ELM4PSIR presents code to train, evaluate, and explore various Language Models (LM) applied to patient safety incident data in the NHS from the National Reporting and Learning System (NLRS) with the goal of creating better models to aid in various downstream tasks.

This repository is experimental and thus models generated using this repository are not suitable to deploy into a production environment without further testing and evaluation - please see the Model Card for more details.

This work was conducted as part of an NHS England DART PhD Internship project by Niall Taylor for around five months between June - November 2022. Further information on the original project proposal can be found here.

The associated report can be found in the reports folder.

Note: No data, public or private are shared in this repository.

Project Stucture

ELM4PSIR is made up of multiple "strands" or pipelines revolving around generating meaningful numerical or embedding representations of text data.

├── checklist_testing     # Example notebook using CheckList with patient safety style data
├── classification_tasks  # Multiple pipelines for training/evaluating classifiers
├── embedding_tools       # Code and notebooks for comparing and visualising LM embeddings
├── language_modelling    # Code for training/evaluating various language models
├── reports               # Project reports
├── topic_modelling       # A folder containing avenues for topic modelling with PLMs
├── utils                 # Scripts for creating training and test patient safety datasets
An example implementation of behavioural testing of trained NLP models with a patient safety example. This utilises the CheckList framework outlined at checklist-repo. Further details provided in the checklist_testing folder.


A quite large set of pipelines for training and evaluating various text classification models on selected downstream pseudo-tasks related to patient safety reports, with more details in the report.


A set of scripts and notebooks for comparing and visualising contextualised embeddings produced by the pretrained language models produced by this repo.


Contains language modelling pipelines for word2vec training with gensim, and transformer based pre-trained language models (PLMs) using huggingface transformers.


An attempt to highlight and direct users to a range of topic modelling approaches to work with the modelling approaches outlined in this repo, in particular the approaches that work with embeddings produced by pretrained language models

Built With

The majority of this codebase was developed in Python v3.8. The work is mostly undertaken in PyTorch and heavily utilises the Transformers package for the vast majority language model handelling.

Further, we have included modified copies of various scripts and codebases within the repository (with references) where appropriate or needed. The most sizable inclusions are the DeCLUTR codebase which was modified to work on our hardware/OS setup, and the Word Vector models used in the baselining approaches. We give thanks to the authors of all components incorporated for making such useful and resuable projects.

⚠️ Warning ⚠️

Python v3.7 was used for the DeCLUTR model training c.f. language_modelling/DeCLUTR. It is highly recommended that a separate virtual environment is used for DeCLUTR, which has its own setup instructions found in the language_modelling folder README.md.

Getting Started


See ./requirements.txt for package versions - installation in a virtual environment is recommended:

conda create --name elm4psir python=3.8
conda activate elm4psir
python -m pip install -r requirements.txt

The repository uses pre-commit hooks to enforce code style using black, follows flake8, and performs a few other checks. See .pre-commit-config.yaml for more details. These hooks will also need installing locally via:

pre-commit autoupdate
pre-commit install

and then will be checked on commit.

When training on GPU machines, the appropriate PyTorch bundle should be installed - for more info: https://pytorch.org/get-started/locally/

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

spaCy NLP

A few of the projects will benefit from using spaCy. To use spacy you have to:

python -m pip install -U spacy

followed by downloading their pretrained models via:

python -m spacy download en_core_web_sm

as an example, refer to their website for more details.

GPU Support

Training with GPU is recommended. Single-GPU training has been tested with:

Usage and Datasets

NOTE The vast majority of the language modelling code in this repo can be used out of the box with any decent sized text dataset, all that is required is care with how the directories are setup and attention to how these different files are saved.

The repository has multiple different pipelines and models to train for various tasks - to enable the training of the language models and classification tasks will require some initial preprocessing steps as follows:

Create training and held out data splits

We create a 90%:10% train/held out datasets from the original raw reports data. We prepare the held-out dataset such that it is available for independent evaluation.

The script then uses the 90% training data as the basis for creating a further 90%:10% train/test split for LM training and other downstream tasks.

Run the following script to create these datasets, all csv files will be stored at the provided save_path

python utils/create_lm_data_split.py --raw_data_file {raw_data_path} --save_path {directory_to_save_new_data} --hold_out_percentage 0.10

Once complete we have saved new training_data/held_out_data and lm_training/lm_test files in the {directory_to_save_new_data} in a csv format.

Preprocessing and cleaning the LM training and test datasets

We apply some fairly light-touch cleaning with some simple regex and removal of tabs/whitespace, etc. for the LM training/test data. This will create a new data folder for with "cleaned data".

One can also opt to process/create a sample of the training and test data, which given the size of the training data can be useful for development/debugging etc.

e.g. run the following to clean and create training data for 10k training notes and 2k test notes

python utils/prepare_notes_for_lm.py --training_notes_path {full_path_to_lm_training_data} --test_notes_path {full_path_to_lm_test_data} --save_path {directory_to_save_cleaned_data} --sample --train_sample_size 10000 --test_sample_size 2000

or if instead using all the data available

python utils/prepare_notes_for_lm.py --training_notes_path {full_path_to_lm_training_data} --test_notes_path {full_path_to_lm_test_data} --save_path {directory_to_save_cleaned_data}

Setup pseudo-classification datasets

We have given the user the possibility to create pseudo-classification tasks with the categorical variables provided alongside the incident report data - pseudo as we are using them as a method to evaluate LMs ability to embed the structure present in the data.

There is the ability to process all possible category/task datasets from available categorical variables given and output them all together in one csv, or to process each individually and save to their own respective folders.

NOTE We have have only implemented downstream models for the following two categorical variables provided with the NHS patient safety incident reports data:

with further details provided inside the classification_tasks folder README.md.

To create the datasets for each category individually run the following:

python utils/prepare_classification_datasets.py --data_path {path_to_training_data} --save_path {path_to_save_data} --process_individually

To combine all tasks into one, the following without the --process_individually flag:

python utils/prepare_classification_datasets.py --data_path {path_to_training_data} --save_path {path_to_save_data}

Alternative data sources


Much of the work presented here is targeted at developing and adapting language modelling techniques for a niche clinical domain, and whilst our work focused on patient safety incident reports, the codebase is largely data agnostic and can be applied to any domain.

A popular, accessible clinical dataset which could be used instead is MIMIC-III, refer to physionet for details on how to access.

From a high-level, the NOTEEVENTS.csv would provide a suitable dataset for this repository. Further, there is a wealth of research that has used these data for NLP models for a variety of language modelling and downstream tasks. A good starting point for pre-processing and curating a suitable dataset for this repo would be the following github-repo.


We would like to thank the members of the Patient Safety Team in NHS England who engaged with us throughout the project and shared their in-depth knowledge of the area to help shape our exploration.