Automated detection for use of legally restricted codes

HelenCEBM commented 2 months ago

There are approximately 6 legally restricted code groups that cannot be returned in OpenSAFELY data (referenced in the DPIA), e.g. for termination of pregnancy. However, this is not yet well documented for users, and it's easy to create codelists that contain these codes and run a query, without any warning that some codes will not be matched against any results. These might produce a surprise zero-matches result that is noticed but will usually fail silently, i.e. produce an incomplete result that can go unnoticed.

If an application clearly depends on the use of these codes it will be picked up at that stage, but it several groups have tried to use these codes as a part of a wider study without realising they are restricted.

Possible solutions:

Load the restricted code groups into OpenCodelists and allow users to diff them with each of their codelists if they think they might be using some of them.
- Pros: easy
- Cons: very manual for users; users may not be aware when to do this and may miss it in docs even if it applies to them.
- NB we haven't yet documented the diff feature afaik
As (1), but add: give users a warning in OpenCodelists when they create a codelist that has any matches with a legally restricted codelist.
- Pros: more automated for users
- Cons: Reusing an existing codelist won't prompt a warning; may be irrelevant or confusing for people using OC for non-OS purposes.
Give users a warning in the OpenSAFELY interface e.g. when updating codelists - to give a warning about matches with restricted codes.
- Pros: all users will get the warning and be made aware of these otherwise silent errors with their codelists.
- Cons: Most complicated solution to implement?

Notes

For all of the solutions, the restricted codelists that we create on OC will require regular (automated) checking to make sure they're kept up to date.

HelenCEBM commented 2 months ago

From @Jongmassey

All four are there in the UK clinical extensions to snomed: General practice summary data sharing exclusion for gender related issues simple reference set 999004351000000109 General practice summary data sharing exclusion for assisted fertilisation simple reference set 999004371000000100 General practice summary data sharing exclusion for termination of pregnancy simple reference set 999004361000000107 General practice summary data sharing exclusion for sexually transmitted disease simple reference set 999004381000000103 and it's possible to get all the member codes for each. It's just a great big download from TRUD every time and AFAICT there's not a convenient mechanism already in place in OpenCodelists to do this automatically.

Jongmassey commented 2 months ago

rough prototype


import csv
from collections import defaultdict
from pathlib import Path

description_file = next(
    Path("Full/Terminology/").glob("sct2_Description_UKCRFull*.txt")
)
exclusion_refset_pattern = "General practice summary data sharing exclusion"

with description_file.open("r") as f:
    reader = csv.DictReader(f, delimiter="\t")
    exclusion_refset_concepts = {
        r["conceptId"]: r["term"]
        for r in reader
        if exclusion_refset_pattern in r["term"]
    }

excluded_concepts = defaultdict(list)
content_file = next(
    Path("Full/Refset/Content/").glob("der2_Refset_SimpleUKCRFull*.txt")
)
with content_file.open("r") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for r in reader:
        for exclusion_conceptId in exclusion_refset_concepts:
            if r["refsetId"] == exclusion_conceptId:
                excluded_concepts[exclusion_conceptId].append(
                    {"conceptId": r["referencedComponentId"]}
                )

for conceptId, term in exclusion_refset_concepts.items():
    with open(f"{conceptId}_{term.replace(' ','-')}.csv", "w") as f:
        writer = csv.DictWriter(f, fieldnames=["conceptId"])
        writer.writeheader()
        writer.writerows(excluded_concepts[conceptId])

Jongmassey commented 2 months ago

using the SnomedCT_UKClinicalRefsetsRF2_PRODUCTION... folder in the latest from SNOMED CT UK Clinical Edition, RF2: Full, Snapshot & Delta release from TRUD

HelenCEBM commented 2 months ago

opensafely-core / opencodelists

Automated detection for use of legally restricted codes #1961

Possible solutions:

Notes