Better support for reference sets aka refsets

wjchulme commented 1 year ago

Roughly-speaking, a reference set (or refset) is a subset of codes in a clinical coding system (eg SNOMED). They define a set of related concepts in a given context and are often used to restrict the set of possible values that can be used for a particular variable. For example, the codes for "patient diagnosis" and "discharge diagnosis" in SUS's emergency care dataset are restricted to the Emergency care diagnosis (991411000000109) SNOMED refset, containing around 1000 codes. You can view this refset on the NHS term browser (click "members" to see each individual snomed code in the refset).

Refsets are basically codelists. As codelists, it makes sense to track them on opencodelists to facilitate integration into opensafely workflows. There's nothing stopping us adding a bunch of refsets to opencodelists now, but there are a few improvements to opencodelists that would make refset capture more useful:

Curation and versioning of NHS refsets is done by NHS England. If a refset is on opencodelists, any updates to it (eg a new code added) should be reflected in opencodelists without too much friction. Ideally this is automated. Maybe there's a reliable API for the term browser, or NHS data dictionary, that could be used to facilitate this? More research needed!
Codes can be removed (deprecated) from a refset as well as added. Depending on the context, we may still want to use those deprecated codes. For example if a code was valid in 2020 but is now deprecated, then data queries covering 2020 should still include the deprecated code. A way to capture "valid from" and "valid to" dates for each code would therefore be more helpful than simply removing deprecated codes from the latest version, with users able to decide which valid range to use. It's likely that the most common use-case is "give me every currently valid and previously valid code in the refset", so this option should be supported as a minimum.
Categorisation. A common thing to want to do with refsets is categorise them into mutually exclusive categories. For example the SNOMED codes available for expressing discharge diagnoses in emergency care can be categorised into [trauma, cardiovascular, respiratory, ...]. Categorising a refset in this way is equivalent to developing a collection of codelists, one for each category. Some researchers already do this, but it's messy (for hopefully obvious reasons). It would be better to enable categorisation as an attribute of the refset itself, including multiple categorisations on the same refset. Categorisation is already possible (eg we do it for ethnicity) but the tooling is basic -- users need to add a category column to the codelist data file and upload that file as a new codelist. Multiple categorisations aren't possible (I think).
Validation. If we had up-to-date refsets in opencodelists, we'd be able to:
- check that values in refset-restricted columns in an opensafely database table belong to the refset, as expected.
- check that dataset definitions that query refset-restricted columns do not use codes that do not belong to the refset (eg ecds.discharge_diagnosis.is_in([1,2]) should fail if the refset for discharge_diagnosis is [2,3,4]).
This is an opensafely backend thing, not a opencodelists thing, and though it's surely been discussed before I'm mentioning here so it's not lost.

This is just a starting point, and probably needs to be split into multiple issues if/when things get going!

brianmackenna commented 1 year ago

Maybe there's a reliable API for the term browser, or NHS data dictionary, that could be used to facilitate this? More research needed!

As it has been described to me the NHS Terminology Server should do this but we will need to investigate if it can meet our needs in practice

Jongmassey commented 5 months ago

Yes, we would have to request a system-to-system account to use it. Alternatively we could just download the whole shebang from TRUD every time they update it and query it locally.

Refsets would be a great starting point, but there's loads more of useful things in SNOMED that we could make use of both for clinical code and medication code lists.

opensafely-core / opencodelists

Better support for reference sets aka refsets #1564