Privacy FingerPrint (PrivFp) - Phase Two - Experiments

NHS England Data Science Team

:warning: Warning to Users :warning:

This codebase is a proof of concept and should only be used for demonstration purposes within a controlled environment. The components are not a live product and should not be deployed in a live or production environment.

About the Project

This repository holds code for Privacy FingerPrint (PrivFp) - Phase 2 Experiments. The original proof of concept can be found here. The aim of the wider project is to develop a modular tool that could be used to calculate a privacy risk score on unstructured clinical data.

This repository develops on previous work that initially looked at using GPT-3.5 (for the generative component) and Amazon Comprehend Medical (for the extraction component), replacing those components with open equivalents.

Note: Only public or fake data are shared in this repository.

Project structure

The main sections of this repository are:

+---data                                   <- Folder where synthetic data is stored
|
+---docs                                   <- MkDocs root directory
|   +---assets                             <- Additional assets for MkDocs
|   +---open-source-extraction-exploration <- Extraction component documentation
|   +---open-source-llm-exploration        <- LLM exploration documentation
|
+---models                                 <- Folder to hold all saved models to help run pipelines faster after configuration has been run
|
+---notebooks                              <- Folder containing notebooks to explore each modules' code
|   +---generative_module                  <- Folder containing notebooks that run the generative module
|   +---extraction_module                  <- Folder containing notebooks that run the extraction module
|
+---overrides                              <- Custom HTML for MkDocs
|
+---src                                    <- Scripts with functions for use in .ipynb notebooks located in the notebooks folder
|   +---ner_pipeline                       <- Contains scripts that can be used to run a named-entity-recognition pipeline
|
|   .gitignore                             <- Files (& file types) automatically removed from version control for security purposes
|   LICENCE                                <- License info for public distribution
|   mkdocs.yml                             <- MkDocs configuration file
|   README.md                              <- Quick start guide / explanation of the project
|   requirements_docs.txt                  <- Requiements needed to run MKdocs locally and develop.
|   requirements_scispacy.txt              <- Requirements needed to run the scispaCy notebook
|   requirements.txt                       <- Requirements to run all notebooks except where scispaCy is used

This diagram illustrates the current state of the project and the structure of each module.

Project Diagram

Getting Started

Built With

Repo Installation

Assuming you have set up SSH credentials with this repository the package can be installed from Github directly by running:

git clone https://github.com/nhsengland/privfp-experiments.git

or via HTTPS or the GitHub CLI.

Setup

All of the setup for this repository is located on the mkdocs. In brief, the docs cover:

HomeBrew Installation
Julia Installation for (py)CorrectMatch
Environment Setup
Setting up Synthea
Install Ollama and set-up Large Language Models
[OPTIONAL] Install UniversalNER Locally

Pipelines and Experiments

For those wanting to familiarise themselves with how PrivFp can be used, we have a notebook stepping through the pipeline here.

Further, we run an Experiment 1.0 and include a write up in the docs here.

Generative Module

Usage

In ./notebooks/generative_module there is a set of notebooks exploring how to run inference using different methods for different use cases.

Outputs

Generative example notes
Evaluation scores

Note that a seed has not been implemented to reproduce the outputs shown.

Model License Agreements

This project currently uses the Llama families of models from Meta AI as default. Usage of these models is governed by the license provided by Meta.

By using these models in this project, you agree to comply with the licensing terms provided by Meta.

For any large commercial use or further inquiries, please contact Meta AI directly.

Llama 2

Llama 2 is licensed under Meta's community license. For more details, please refer to META's Llama 2 Licensing page. To sign the corresponding license agreement you can apply for this via huggingface.

Llama 3

Llama 3 is licensed under Meta's community license. For more details, please refer to META's Llama 3 Licensing page. To sign the corresponding license agreement you can apply for this via huggingface.

Llama 3.1

Llama 3.1 is licensed under Meta's community license. For more details, please refer to META's Llama 3.1 Licensing page. To sign the corresponding license agreement you can apply for this via huggingface.

Extraction Module

Usage

Although more recently we have moved to use GLiNER as the base NER model, in ./notebooks/extraction_module/ner_exploration there is a set of notebooks exploring how to implement a range of named entity recognition (NER) models.

numind_NER.ipynb explores a NER model created by Numind.
spacy_and_scispacy.ipynb explores a range of NER models released by spaCy. (A different environment will be required to run SciSpacy.)
spanMarker.ipynb explores NER spanMarker model set-up and a possible integration with spaCy models.
uniNER_quantised.ipynb requires a quantised version of UniversalNER:

The quantised model was created by cloning llama.cpp repo and quantising the Universal-NER/UniNER-7B-type locally to a quantized_q4_1.gguf format.

The llama.cpp repo has guidance on their repo in their Prepare and Quantize section. Alternatively their is a medium article that goes through all of this in a step-by-step process.

uniNER_api.ipynb explores the deployment of UniversalNER using an API. (This involves using the llama.cpp repo to server a quantised model.)

Extra guidance on serving a model in this repo is outlined in the llama.cpp serving documentation.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidance.

Licence

Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.

See LICENSE for more information.

Contact

This repository is maintained by NHS England Data Science Team. To contact us raise an issue on Github or via email.

nhsengland / privfp-experiments

readme

Privacy FingerPrint (PrivFp) - Phase Two - Experiments

NHS England Data Science Team

:warning: Warning to Users :warning:

About the Project

Project structure

Getting Started

Built With

Repo Installation

Setup

Pipelines and Experiments

Generative Module

Usage

Outputs

Model License Agreements

Llama 2

Llama 3

Llama 3.1

Extraction Module

Usage

Contributing

Licence

Contact

Contributors (Alphabetical)