Investigate strange prevalences: events with null dates

sebbacon commented 3 years ago

See #2 for general background info

Context: we are writing a short data report to compare prevalences for all the codelists used in the PRIMIS vaccination specification between EMIS and TPP. This is a "smell test" for any possible problems: we would expect prevalences to be roughly similar, and where they are not, we would expect to be able to come up with plausible hyptheses why (e.g. EMIS populations tend to be younger and more ethnically diverse largely because they are over-represented in London).

The population is defined as all registered patients between the ages of 16 and 120.

In this context, we noticed a few differences which we felt were probably explainable, and one very odd one - a lot of over 70s being marked as pregnant in TPP (but not EMIS). We eventually found a satisfying hypothesis for this around user interface choices between the systems. This was observed at what we're calling Point A.

Then, in subsequent, related analyses, we started to see inconsistencies between analyses against the EMIS data:

Label	Date / time	Characteristics	Links
A	On or before March 1	Close to zero pregnant >70yo	csv
DB1	March 15	Database refresh
B1	March 17, 10:45-12:48	Considerably more pregnant >70yo	notebook, data diff with A
B2	March 17, 12:45-15:46	Close to zero pregnant men >70yo, women >70yo might higher rates than at A. Additionally, 70k fewer patients in `input.csv`.	notebook, preg codes csv, pregdel codes csv
C	March 24, afternoon	Close to zero pregnant >70yo
DB2	(expected March 24)

Questions that arise:

Are there, or are there not, non-trivial numbers of >70s with pregnancy codelists in EMIS?
Why have we got significantly different answers at different times?
- Specifically, both between points A and B, and between subpoints B1 and B2
Why was the total count of patients different between B1 and B2? B1 and B2 are slightly different study definitions but we are reasonably satisfied they should be selecting the same underlying population and therefore should return the same number of patients.

Hypotheses:

Inconsistencies that appear even within one database build in EMIS data
There's a manual process @inglesp currently has to do after each build: create a table which is a deduped version of the underlying patients table. We have no record of when this was done (other than it either started or ended at 10:47:44 when last run), but it's possible he forgot at some point, and therefore one of the notebooks used older-than-expected data. If the last dedupe table was built between B1 and B2 it could explain losing 70k patients; it is plausible the underlying data genuinely lost 70k patients, though it seems unlikely they would mostly be >70yos recorded as pregnant
Bugs in our notebooks. But population part of the study definition is very straightforward and our comparison of patient counts in the input.csv files just used wc -l.
Bugs in our generated SQL. But it is well-tested.

Notes:

On the EMIS server, historic outputs are saved in the outputs/ folder with datestamps, viz input020210317.csv (output of study_definition.py) and input_with_codes-202103016.csv (output of study_definition_with_codes.py)
Relevant job on the job server

Investigation plan:

Rerun with current data (happening now)
- Hopefully this will complete before the new build (also happening now) deploys and we will have two further comparison points
What are the individual codes that account for the change? (Helen has a study definition for this)
- It looks like the "core" / normal pregnancy codes

HelenCEBM commented 3 years ago

I can't edit the above comment but here are some more links:

A: csv
B2:
- preg codes csv
- pregdel codes csv
- Here, where we are looking at individual codes, the figures look much better for men (e.g. there were no preg codes found in men over 70 at all, and only 3 at 0.0 per thousand for pregdel) but for women over 70 rates were much higher than at point A, summing to ~0.8 preg codes per thousand and ~1.7 pregdel codes per thousand (give or take rounding errors). These figures for females are similar to the counts at point B1 but the opposite way around for preg vs pregdel.

inglesp commented 3 years ago

Here are some counts of patients with pregnancy codes.

# load the CSVs
>>> df1 = pd.read_csv("output/input-20210324.csv", usecols=["age", "sex", "preg", "pregdel"])
>>> df2 = pd.read_csv("output/input_with_codes-20210324.csv", usecols=["age", "sex", "preg", "preg_date", "pregdel", "pregdel_date"])

# same total number of records in each output file
>>> len(df1)
27093304
>>> len(df2)
27093304

# total number of records with pregnancy codes
# the small difference is because some records have a pregnancy code but no date
>>> len(df1[df1["preg"].notna()])
4363958
>>> len(df2[df2["preg"].notna()])
4369385
>>> len(df2[df2["preg_date"].notna()])
4363958

# restrict data to having a pregnancy code in 2020 or after
>>> df1_preg = df1[df1["preg"] >= 2020]
>>> df2_preg = df2[df2["preg_date"] >= 2020]

# same number of records
>>> len(df1_preg)
457945
>>> len(df2_preg)
457945

# restrict data to having a pregnancy code in 2020 or after, and patient being 70+
>>> df1_preg_70plus = df1_preg[df1_preg["age"] >= 70]
>>> df2_preg_70plus = df2_preg[df2_preg["age"] >= 70]

# same number of records
>>> len(df1_preg_70plus)
232
>>> len(df2_preg_70plus)
232

# ... and we see the same with codes for pregnancy + delivery
>>> len(df1[df1["pregdel"].notna()])
6578474
>>> len(df2[df2["pregdel"].notna()])
6596578
>>> len(df2[df2["pregdel_date"].notna()])
6578474
>>> 
>>> df1_pregdel = df1[df1["pregdel"] >= 2020]
>>> df2_pregdel = df2[df2["pregdel_date"] >= 2020]
>>> 
>>> len(df1_pregdel)
571603
>>> len(df2_pregdel)
571603
>>> 
>>> df1_pregdel_70plus = df1_pregdel[df1_pregdel["age"] >= 70]
>>> df2_pregdel_70plus = df2_pregdel[df2_pregdel["age"] >= 70]
>>> 
>>> len(df1_pregdel_70plus)
510
>>> len(df2_pregdel_70plus)
510

@HelenCEBM how does this differ to what you've seen?

HelenCEBM commented 3 years ago

Ah! The codes-with-missing-dates could be the answer. I hadn't considered that it was possible for an event not to have an associated date (is this possible in TPP?). The way I was filtering the years would have left these in. And it makes sense this would give an excess of codes for older women (i.e. genuine pregnancy events that happened at some unknown date in the past) but not men.

sebbacon commented 3 years ago

Actions:

[x] @inglesp to do a direct count of numbers of nulls in event dates in EMISX - perhaps also count of max and min?
- Answer: about 0.08% have NULL in EMISX (17452988 / 22391725737)
- 3952 things in 9999
- Long tail of things in previous 8000 years, with peaks around 2620, 2199 etc, and 1000s in 2062 - 2099
- 4000 things in 1850
- CSV with per-year count of observations in next comment
[x] @HelenCEBM to do same in TPP
- Answer 0 nulls (couldn't get total events from SQL query, SQL threw an error with count(*)!)
- 14.8m in 1900 (with tens of thousands in adjacent years 1899 and 1901) - this is comparable to the number of nulls in EMIS.
- Range (with 99 or more events) 1840-2099 and 9999.
- Small peaks at 2050, 2060, .. , 2090 and 2099 (low thousands)
- Small peaks also at 1840, 1850 and 1860 (11k in 1860)
- 39k in 2022, 15k in 2023.
[ ] @HelenCEBM to ask Chris if & how empty dates are handled in TPP
[ ] As a result of the above investigations, consider if/how to handle this in study defs so mistakes are impossible and there's a consistent way of filtering across both backend

inglesp commented 3 years ago

emis-counts.csv.txt

HelenCEBM commented 3 years ago

In order to consider the implications for study definitions I have also checked which are the most common codes in the future (year>2021) and the "past" (1900, ie. unknown date) in TPP. Note these include all events and so many will be for deregistered/deceased patients.

Future:

Y4615 (33k) = batch no and expiry,
72313002 (11k) = systolic bp,
1091811000000102 (8.5k) = diastolic bp

Probably these are just very common codes, and are occasionally entered with an incorrect date, which is sometimes in the future.
Taking BPs as as an example, when looking at these values in a study definition we would probably want to exclude them, because, if the dates are simply incorrect then essentially the dates are unknown and we can't assume they are later than any readings with a realistic date. So an on_or_before limit would be advisable.
The numbers are quite low so not likely to have a large impact.

"Past" (unknown date):

60504008 (1.14m) = marital status unknown,
92391000000108 (713k) = British ethnicity (2001),
Y0529 (508k) = Imported notes

As indicated by the third code, events with unknown dates may be largely imported. It's also possible that demographic information is allowed to be entered without dates because e.g. the effective-from date of someone's marital status would not be known.
In the case of ethnicity it is useful to include entries with an unknown date as these don't change over time. However, ethnicity may be a special case in this respect.

HelenCEBM commented 3 years ago

Question re. study definitions, how are missing dates currently handled by find_first_match_in_period? E.g. if someone has two of the same events recorded in EMIS, one with a NULL date and one with a valid date, is the NULL event considered to be earlier, later, or ignored altogether? Clearly it's possible to return non-date values for events with NULL dates, so they can't be totally ignored, but perhaps this only occurs if there are no other matching events (with valid dates) for a given patient?

sebbacon commented 3 years ago

So it seems to me that when getting observations from the data, people need to consider:

Is this a commonly-varying (BP, cholesterol) or usually-constant value (sex ethnicity)
- This is a sliding scale! e.g. smoking, height are somewhere in the middle
What dates should we consider unreliable: probably anything on or before 1900-01-01, in the future, null, or (ideally) before the birth date of the patient
We should consider normalising unreliable dates to nulls
Commonly-varying observations with null dates should be censored
Usually-constant value observations should not be censored

One option is that this could be a standard action in our actions library which converts probably-unreliable dates to nulls; it should perhaps be in our template project.yaml as a standard, but people can opt out of it.

Then our "reporting" action should, as standard, highlight the number of null dates for each date variable, and the user can decide what to do with one.

Would this make sense, do you think?

sebbacon commented 3 years ago

And to answer the other question, if a patient has multiple dates for a code including a null, e.g. [NULL, "2021-03-01", "2021-03-05"] then the first date returned would be "2021-03-01"

HelenCEBM commented 3 years ago

So a reading with a NULL date (e.g. a numeric_value for BP) will only come up if the patient had no others matching the criteria supplied. (And if the study has no date limits).
I'm not sure about making dates NULL, because it produces inconsistencies (where no date limits are applied) between different values returned from the same query (e.g. binary_flag or numeric value could be present where date is not) and therefore users would need to be aware of this to treat these appropriately. I'd be more in favour of normalising all unreliable dates to e.g 1900-01-01.
For numeric values we would probably want to censor results as you say, or enforce use of date limits...
I think the most urgent issue is that there's a difference between EMIS and TPP, so one solution to that would be to treat null dates in EMIS as 1900-01-01. This could give more predictable/consistent behaviour in results.
We could also add more caveats around this to the documentation and strongly encourage the use of between or on_or_before/after (even if they're very wide) so that events with null dates are simply not captured in study definitions.

sebbacon commented 3 years ago

it produces inconsistencies (where no date limits are applied) between different values returned from the same query (e.g. binary_flag or numeric value could be present where date is not)

I didn't understand this, sorry. What's wrong with having a null date and a non-null binary_flag for one patient?

HelenCEBM commented 3 years ago

What's wrong with having a null date and a non-null binary_flag for one patient?

It's counterintuitive! I had no idea why counting different columns was giving me different results, because it didn't occur to me that dates of clinical events could be missing... but maybe that's just me.

HelenCEBM commented 3 years ago

Just to check: if the study_definition only has a filter for on_or_before for a particular variable we will miss any events with NULL dates in EMIS? But we will capture them in TPP as occurring in 1900.

HelenCEBM commented 3 years ago

Further detail on clinical codes in TPP with unknown dates:

1900 top 50
- Top entry is Marital status unknown (>1m)
- Various other demographic codes appear relating to marital statuses, ethnicities, languages, smoking/alcohol, along with other administrative procedures e.g. relating to registration.
- Diagnoses: Asthma appears high in the list with 206k records, hypertension 106k, penicillin reaction 79k
- Findings: Systolic bp 125k records, Body weight 65k.
- Some records relate to family history of particular conditions, or non-specific clinical events e.g. Radiology/physics in medicine (procedure)
1901 top 20
- Mostly related to registration, administration, history or non-specific clinical events (e.g. Drug prescription (situation))
- The first events in the list are cervical smears and Asthma (each 500-1000 records)
1899 top 20
- Top 2 codes are imported notes (7.8k) / textual problem (3.2k)
- Asthma is third, with 3000 records, and there are various other observations/clinical findings such as full blood count, blood pressure, height, BMI etc with 1500-2000 records each.

EMIS unknown dates:

Most common items relate to marital status ("unknown" 1.4m) and administration. However, there are some codes not successfully linked to SNOMED codes which may be local codes (top entry has 1.3m records). Other common entries are non-specific procedures, situations or histories.
Top disorders and observable entities Asthma (221k), Height (169k), Hypertension (127k). systolic bp (60k)
Some family history of specific conditions also appear e.g. hypertension (75k)
Other codes of interest include Alcohol abuse prevention education (76k), Excision of appendix (39k), Birth weight (26k).

Overall

It is likely useful to include null dates (i.e. no on_or_after date) for:

ethnicity
family history of specific conditions ("at risk" flags)
personal history (i.e. "ever had a blood pressure reading over X", "ever had participated in screening for X", "ever smoked")
other irreversible conditions/procedures e.g. tonsillectomy, excision of appendix, hysterectomy

When looking at the current or latest situation it may be advisable to always include a on_or_after date to exclude nulls. Examples may include:

height/weight/BMI - if a patient only has a recording with a null date it is unknown when this occurred or is still valid.
Most diagnoses of conditions that may resolve or vary over time.

HelenCEBM commented 3 years ago

@inglesp please could you run a similar query in EMIS to extract the top ~100 SNOMED codes with no known date? (The SQL I ran in TPP is here for reference but this will be simpler in EMIS as there are no odd CTV3 codes to fetch descriptions for).

inglesp commented 3 years ago

Here you go: snomedct_codes_without_dates.csv.txt

There are seven that we don't have terms for -- I presume these are EMIS local codes.

opensafely / emis-qa