Include code frequency indicators in builder tool

sebbacon commented 4 years ago

When authoring a new codelist, there are diminishing returns on the time the author spends on getting every last code correct. On a large codelist, this long tail of apparently-very-unusual-but-possibly-significant codes can take a significant amount of time to QA, possibly with very little benefit: but no-one ever finds out if the work was useful.

It would be good to have an indicator of the frequency of each code for authors to access when making decisions about code inclusion (or exclusion).

From an IG point of view it's been suggested we just start with frequencies for codes in our studies, but this doesn't help with the primary use-case, which is helping authors of new codelists (or editors of existing ones) to make decisions about adding unusual codes.

From a research point of view there's a concern that code counts could be misused for very crude studies on their own.

For the purposes of codelist authoring, centiles of a log rank would probably be sufficient: we just want to know "this code is vanishingly infrequently used". This could probably be across a single year of data aggregated across TPP and EMIS. This may address possible concerns around data stewardship etc.

evansd commented 4 years ago

centiles of a log rank would probably be sufficient

Oh nice idea.

LiamSmeeth commented 4 years ago

Yep, sounds good to me! Thanks

Bengoldacre commented 4 years ago

A key (non-IG) risk is that, while we are sharing codelist frequencies to facilitate generation and review of codelists, other people might clumsily use the resource on its own to do bad research on prevalence or trends, e.g. without recognising that these are event code counts not patient counts.

So also fine would be to list the codes with broad ordinal categories for frequency of appearance ever, e.g. 0-100, 100-10,000, 10,000-1,000,000, 1,000,000 up.

(That said, "centiles of a log rank" combines utility with an informal entrance exam! Tho it might be harder to explain to IG people when checking they are fine with us sharing.)

alexwalkerepi commented 2 years ago

Something that could make this issue much easier: NHSD have now published code usage counts for SNOMED codes.

SNOMED Code Usage in Primary Care - NHS Digital

As it's open data anyway we might not even need to worry about rounding or otherwise obfuscating the counts.

HelenCEBM commented 2 years ago

If using the above NHSD counts we should note that

the figures represent total usage
total usage may not be split equally between software providers
we cannot necessarily extract all codes from opensafely-TPP as some information is held in different tables or using different codes

alexwalkerepi commented 2 years ago

Yes good to document those to be clear, but I think it's fine even with those caveats. All that's really needed is a ballpark for whether a code gets used commonly.

Also, in theory codelists will be increasingly used outside of TPP too.

brianmackenna commented 6 months ago

Update : NHS Digital have continued to release annual files. @LFISHER7 has made a handy browser

https://lfisher7-snomed-time-series-app-czzgeo.streamlit.app/

Jongmassey commented 1 month ago

@milanwiedemann FYI

opensafely-core / opencodelists

Include code frequency indicators in builder tool #33