Open sebbacon opened 4 years ago
centiles of a log rank would probably be sufficient
Oh nice idea.
Yep, sounds good to me! Thanks
A key (non-IG) risk is that, while we are sharing codelist frequencies to facilitate generation and review of codelists, other people might clumsily use the resource on its own to do bad research on prevalence or trends, e.g. without recognising that these are event code counts not patient counts.
So also fine would be to list the codes with broad ordinal categories for frequency of appearance ever, e.g. 0-100, 100-10,000, 10,000-1,000,000, 1,000,000 up.
(That said, "centiles of a log rank" combines utility with an informal entrance exam! Tho it might be harder to explain to IG people when checking they are fine with us sharing.)
Something that could make this issue much easier: NHSD have now published code usage counts for SNOMED codes.
SNOMED Code Usage in Primary Care - NHS Digital
As it's open data anyway we might not even need to worry about rounding or otherwise obfuscating the counts.
If using the above NHSD counts we should note that
Yes good to document those to be clear, but I think it's fine even with those caveats. All that's really needed is a ballpark for whether a code gets used commonly.
Also, in theory codelists will be increasingly used outside of TPP too.
Update : NHS Digital have continued to release annual files. @LFISHER7 has made a handy browser
https://lfisher7-snomed-time-series-app-czzgeo.streamlit.app/
@milanwiedemann FYI
When authoring a new codelist, there are diminishing returns on the time the author spends on getting every last code correct. On a large codelist, this long tail of apparently-very-unusual-but-possibly-significant codes can take a significant amount of time to QA, possibly with very little benefit: but no-one ever finds out if the work was useful.
It would be good to have an indicator of the frequency of each code for authors to access when making decisions about code inclusion (or exclusion).
From an IG point of view it's been suggested we just start with frequencies for codes in our studies, but this doesn't help with the primary use-case, which is helping authors of new codelists (or editors of existing ones) to make decisions about adding unusual codes.
From a research point of view there's a concern that code counts could be misused for very crude studies on their own.
For the purposes of codelist authoring, centiles of a log rank would probably be sufficient: we just want to know "this code is vanishingly infrequently used". This could probably be across a single year of data aggregated across TPP and EMIS. This may address possible concerns around data stewardship etc.