Open evansd opened 6 months ago
Thoughts:
all_diagnoses
, all_procedures
), so perhaps best to make a specific type for each of these fields rather than trying a more generalizable approach. Until we have more examples I'm not sure we know the best way to generalize itall_diagnoses
field I have observed (from previous role, rather than the data available here) that the ICD10 codes are not in the standard format (C00 or C00.3)
.
s, so C00.3 (Malignant neoplasm of upper lip, inner aspect) becomes C003perhaps best to make a specific type for each of these fields rather than trying a more generalizable approach
Yes, 100% agreed.
I have observed (from previous role, rather than the data available here) that the ICD10 codes are not in the standard format
Ah, this is very helpful (and does ring a vague bell for me as well). I think we need to do some exploratory work first to figure out exactly what codes are in here and what format they take. I think some of this has been done before but probably in quite a haphazard way in the heat of the pandemic and it would be worth doing again a bit more systematically.
Update following chat with @evansd:
We could investigate these fields properly and work out precisely what their content is- and indeed should at some point. But for this initiative it's probably ok for the quick win of providing something that satifies:
Prevent equality comparisons between the string and a single code so you can't accidentally write:
list_of_codes_field == single_code
Ensure that the arguments to
.contains()
queries are of a format that could potentially match something in the field.Note that for 2 we shouldn't require that the argument to
.contains()
is a complete valid code because it may just be a prefix to a code: for lexically hierarchical coding systems like ICD-10 it's useful to be able to match codes likeN171
using just the prefixN17
.
We have some columns containing codes which we have to declare as type
str
rather than their coding system (e.gICD-10
) because they contain multiple codes shoved into a single column. For example: https://docs.opensafely.org/ehrql/reference/schemas/tpp/#apcs.all_diagnosesWe query these columns using the
.contains()
method, which takes strings. But we still want to ensure that the strings supplied are valid for the coding system in question. The quick hacky way I did this when writing the original proof-of-concept study was to construct a code object and use the internal_to_primitive_type()
method to get the string back out of the code object:Annoyingly (but also as a positive testament to the code sharing that goes on) this has propagated itself into [various research codebases](https://github.com/search?q=org%3Aopensafely%20_to_primitive_type()&type=code).
At a minimum we should provide a non-private and less ugly way to do this. One simple way would be to define a
__str__
method onBaseCode
so we could rewrite the above as:However that still leaves the study author with the responsibility to use the correct coding system when checking the codes, which is exactly what ehrQL is supposed to avoid.
The long-term solution here is proper handling of multi-valued columns – but that's a whole lot of thinking and work and I don't propose to tackle that now.
I wonder if there's a simpler improvement here which is to define an explicit type for "bunch of codes from coding system X shoved in a string". That could catch a couple of things:
list_of_codes_field == single_code
.contains()
queries are of a format that could potentially match something in the field.Note that for 2 we shouldn't require that the argument to
.contains()
is a complete valid code because it may just be a prefix to a code: for lexically hierarchical coding systems like ICD-10 it's useful to be able to match codes likeN171
using just the prefixN17
.Related Slack threads: https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1715334831886929 https://bennettoxford.slack.com/archives/C069YDR4NCA/p1715344398345399