Open sebbacon opened 3 years ago
I can't edit the above comment but here are some more links:
preg
codes found in men over 70 at all, and only 3 at 0.0 per thousand for pregdel
) but for women over 70 rates were much higher than at point A, summing to ~0.8 preg
codes per thousand and ~1.7 pregdel
codes per thousand (give or take rounding errors). These figures for females are similar to the counts at point B1 but the opposite way around for preg
vs pregdel
.Here are some counts of patients with pregnancy codes.
# load the CSVs
>>> df1 = pd.read_csv("output/input-20210324.csv", usecols=["age", "sex", "preg", "pregdel"])
>>> df2 = pd.read_csv("output/input_with_codes-20210324.csv", usecols=["age", "sex", "preg", "preg_date", "pregdel", "pregdel_date"])
# same total number of records in each output file
>>> len(df1)
27093304
>>> len(df2)
27093304
# total number of records with pregnancy codes
# the small difference is because some records have a pregnancy code but no date
>>> len(df1[df1["preg"].notna()])
4363958
>>> len(df2[df2["preg"].notna()])
4369385
>>> len(df2[df2["preg_date"].notna()])
4363958
# restrict data to having a pregnancy code in 2020 or after
>>> df1_preg = df1[df1["preg"] >= 2020]
>>> df2_preg = df2[df2["preg_date"] >= 2020]
# same number of records
>>> len(df1_preg)
457945
>>> len(df2_preg)
457945
# restrict data to having a pregnancy code in 2020 or after, and patient being 70+
>>> df1_preg_70plus = df1_preg[df1_preg["age"] >= 70]
>>> df2_preg_70plus = df2_preg[df2_preg["age"] >= 70]
# same number of records
>>> len(df1_preg_70plus)
232
>>> len(df2_preg_70plus)
232
# ... and we see the same with codes for pregnancy + delivery
>>> len(df1[df1["pregdel"].notna()])
6578474
>>> len(df2[df2["pregdel"].notna()])
6596578
>>> len(df2[df2["pregdel_date"].notna()])
6578474
>>>
>>> df1_pregdel = df1[df1["pregdel"] >= 2020]
>>> df2_pregdel = df2[df2["pregdel_date"] >= 2020]
>>>
>>> len(df1_pregdel)
571603
>>> len(df2_pregdel)
571603
>>>
>>> df1_pregdel_70plus = df1_pregdel[df1_pregdel["age"] >= 70]
>>> df2_pregdel_70plus = df2_pregdel[df2_pregdel["age"] >= 70]
>>>
>>> len(df1_pregdel_70plus)
510
>>> len(df2_pregdel_70plus)
510
@HelenCEBM how does this differ to what you've seen?
Ah! The codes-with-missing-dates could be the answer. I hadn't considered that it was possible for an event not to have an associated date (is this possible in TPP?). The way I was filtering the years would have left these in. And it makes sense this would give an excess of codes for older women (i.e. genuine pregnancy events that happened at some unknown date in the past) but not men.
Actions:
count(*)
!)In order to consider the implications for study definitions I have also checked which are the most common codes in the future (year>2021) and the "past" (1900, ie. unknown date) in TPP. Note these include all events and so many will be for deregistered/deceased patients.
Y4615
(33k) = batch no and expiry,72313002
(11k) = systolic bp,1091811000000102
(8.5k) = diastolic bpon_or_before
limit would be advisable.60504008
(1.14m) = marital status unknown,92391000000108
(713k) = British ethnicity (2001),Y0529
(508k) = Imported notesQuestion re. study definitions, how are missing dates currently handled by find_first_match_in_period
? E.g. if someone has two of the same events recorded in EMIS, one with a NULL date and one with a valid date, is the NULL event considered to be earlier, later, or ignored altogether? Clearly it's possible to return non-date values for events with NULL dates, so they can't be totally ignored, but perhaps this only occurs if there are no other matching events (with valid dates) for a given patient?
So it seems to me that when getting observations from the data, people need to consider:
One option is that this could be a standard action in our actions library which converts probably-unreliable dates to nulls; it should perhaps be in our template project.yaml
as a standard, but people can opt out of it.
Then our "reporting" action should, as standard, highlight the number of null dates for each date variable, and the user can decide what to do with one.
Would this make sense, do you think?
And to answer the other question, if a patient has multiple dates for a code including a null, e.g. [NULL, "2021-03-01", "2021-03-05"]
then the first date returned would be "2021-03-01"
numeric_value
for BP) will only come up if the patient had no others matching the criteria supplied. (And if the study has no date limits).binary_flag
or numeric value
could be present where date
is not) and therefore users would need to be aware of this to treat these appropriately. I'd be more in favour of normalising all unreliable dates to e.g 1900-01-01
. 1900-01-01
. This could give more predictable/consistent behaviour in results. between
or on_or_before
/after
(even if they're very wide) so that events with null dates are simply not captured in study definitions. it produces inconsistencies (where no date limits are applied) between different values returned from the same query (e.g.
binary_flag
ornumeric value
could be present wheredate
is not)
I didn't understand this, sorry. What's wrong with having a null date
and a non-null binary_flag
for one patient?
What's wrong with having a null date and a non-null binary_flag for one patient?
It's counterintuitive! I had no idea why counting different columns was giving me different results, because it didn't occur to me that dates of clinical events could be missing... but maybe that's just me.
Just to check: if the study_definition only has a filter for on_or_before
for a particular variable we will miss any events with NULL dates in EMIS? But we will capture them in TPP as occurring in 1900.
Further detail on clinical codes in TPP with unknown dates:
Radiology/physics in medicine (procedure)
Drug prescription (situation)
)EMIS unknown dates:
It is likely useful to include null dates (i.e. no on_or_after
date) for:
When looking at the current or latest situation it may be advisable to always include a on_or_after
date to exclude nulls. Examples may include:
@inglesp please could you run a similar query in EMIS to extract the top ~100 SNOMED codes with no known date? (The SQL I ran in TPP is here for reference but this will be simpler in EMIS as there are no odd CTV3 codes to fetch descriptions for).
Here you go: snomedct_codes_without_dates.csv.txt
There are seven that we don't have terms for -- I presume these are EMIS local codes.
See #2 for general background info
Context: we are writing a short data report to compare prevalences for all the codelists used in the PRIMIS vaccination specification between EMIS and TPP. This is a "smell test" for any possible problems: we would expect prevalences to be roughly similar, and where they are not, we would expect to be able to come up with plausible hyptheses why (e.g. EMIS populations tend to be younger and more ethnically diverse largely because they are over-represented in London).
The population is defined as all registered patients between the ages of 16 and 120.
In this context, we noticed a few differences which we felt were probably explainable, and one very odd one - a lot of over 70s being marked as pregnant in TPP (but not EMIS). We eventually found a satisfying hypothesis for this around user interface choices between the systems. This was observed at what we're calling Point A.
Then, in subsequent, related analyses, we started to see inconsistencies between analyses against the EMIS data:
input.csv
.Questions that arise:
Hypotheses:
input.csv
files just usedwc -l
.Notes:
outputs/
folder with datestamps, vizinput020210317.csv
(output ofstudy_definition.py
) andinput_with_codes-202103016.csv
(output ofstudy_definition_with_codes.py
)Investigation plan:
It looks like the "core" / normal pregnancy codes