opensafely / emis-qa

MIT License
0 stars 0 forks source link

Investigate strange prevalences: events with null dates #4

Open sebbacon opened 3 years ago

sebbacon commented 3 years ago

See #2 for general background info

Context: we are writing a short data report to compare prevalences for all the codelists used in the PRIMIS vaccination specification between EMIS and TPP. This is a "smell test" for any possible problems: we would expect prevalences to be roughly similar, and where they are not, we would expect to be able to come up with plausible hyptheses why (e.g. EMIS populations tend to be younger and more ethnically diverse largely because they are over-represented in London).

The population is defined as all registered patients between the ages of 16 and 120.

In this context, we noticed a few differences which we felt were probably explainable, and one very odd one - a lot of over 70s being marked as pregnant in TPP (but not EMIS). We eventually found a satisfying hypothesis for this around user interface choices between the systems. This was observed at what we're calling Point A.

Then, in subsequent, related analyses, we started to see inconsistencies between analyses against the EMIS data:

Label Date / time Characteristics Links
A On or before March 1 Close to zero pregnant >70yo csv
DB1 March 15 Database refresh
B1 March 17, 10:45-12:48 Considerably more pregnant >70yo notebook, data diff with A
B2 March 17, 12:45-15:46 Close to zero pregnant men >70yo, women >70yo might higher rates than at A. Additionally, 70k fewer patients in input.csv. notebook, preg codes csv, pregdel codes csv
C March 24, afternoon Close to zero pregnant >70yo
DB2 (expected March 24)

Questions that arise:

Hypotheses:

Notes:

Investigation plan:

HelenCEBM commented 3 years ago

I can't edit the above comment but here are some more links:

inglesp commented 3 years ago

Here are some counts of patients with pregnancy codes.

# load the CSVs
>>> df1 = pd.read_csv("output/input-20210324.csv", usecols=["age", "sex", "preg", "pregdel"])
>>> df2 = pd.read_csv("output/input_with_codes-20210324.csv", usecols=["age", "sex", "preg", "preg_date", "pregdel", "pregdel_date"])

# same total number of records in each output file
>>> len(df1)
27093304
>>> len(df2)
27093304

# total number of records with pregnancy codes
# the small difference is because some records have a pregnancy code but no date
>>> len(df1[df1["preg"].notna()])
4363958
>>> len(df2[df2["preg"].notna()])
4369385
>>> len(df2[df2["preg_date"].notna()])
4363958

# restrict data to having a pregnancy code in 2020 or after
>>> df1_preg = df1[df1["preg"] >= 2020]
>>> df2_preg = df2[df2["preg_date"] >= 2020]

# same number of records
>>> len(df1_preg)
457945
>>> len(df2_preg)
457945

# restrict data to having a pregnancy code in 2020 or after, and patient being 70+
>>> df1_preg_70plus = df1_preg[df1_preg["age"] >= 70]
>>> df2_preg_70plus = df2_preg[df2_preg["age"] >= 70]

# same number of records
>>> len(df1_preg_70plus)
232
>>> len(df2_preg_70plus)
232

# ... and we see the same with codes for pregnancy + delivery
>>> len(df1[df1["pregdel"].notna()])
6578474
>>> len(df2[df2["pregdel"].notna()])
6596578
>>> len(df2[df2["pregdel_date"].notna()])
6578474
>>> 
>>> df1_pregdel = df1[df1["pregdel"] >= 2020]
>>> df2_pregdel = df2[df2["pregdel_date"] >= 2020]
>>> 
>>> len(df1_pregdel)
571603
>>> len(df2_pregdel)
571603
>>> 
>>> df1_pregdel_70plus = df1_pregdel[df1_pregdel["age"] >= 70]
>>> df2_pregdel_70plus = df2_pregdel[df2_pregdel["age"] >= 70]
>>> 
>>> len(df1_pregdel_70plus)
510
>>> len(df2_pregdel_70plus)
510

@HelenCEBM how does this differ to what you've seen?

HelenCEBM commented 3 years ago

Ah! The codes-with-missing-dates could be the answer. I hadn't considered that it was possible for an event not to have an associated date (is this possible in TPP?). The way I was filtering the years would have left these in. And it makes sense this would give an excess of codes for older women (i.e. genuine pregnancy events that happened at some unknown date in the past) but not men.

sebbacon commented 3 years ago

Actions:

inglesp commented 3 years ago

emis-counts.csv.txt

HelenCEBM commented 3 years ago

In order to consider the implications for study definitions I have also checked which are the most common codes in the future (year>2021) and the "past" (1900, ie. unknown date) in TPP. Note these include all events and so many will be for deregistered/deceased patients.

Future:

  1. Y4615 (33k) = batch no and expiry,
  2. 72313002 (11k) = systolic bp,
  3. 1091811000000102 (8.5k) = diastolic bp

"Past" (unknown date):

  1. 60504008 (1.14m) = marital status unknown,
  2. 92391000000108 (713k) = British ethnicity (2001),
  3. Y0529 (508k) = Imported notes
HelenCEBM commented 3 years ago

Question re. study definitions, how are missing dates currently handled by find_first_match_in_period? E.g. if someone has two of the same events recorded in EMIS, one with a NULL date and one with a valid date, is the NULL event considered to be earlier, later, or ignored altogether? Clearly it's possible to return non-date values for events with NULL dates, so they can't be totally ignored, but perhaps this only occurs if there are no other matching events (with valid dates) for a given patient?

sebbacon commented 3 years ago

So it seems to me that when getting observations from the data, people need to consider:

One option is that this could be a standard action in our actions library which converts probably-unreliable dates to nulls; it should perhaps be in our template project.yaml as a standard, but people can opt out of it.

Then our "reporting" action should, as standard, highlight the number of null dates for each date variable, and the user can decide what to do with one.

Would this make sense, do you think?

sebbacon commented 3 years ago

And to answer the other question, if a patient has multiple dates for a code including a null, e.g. [NULL, "2021-03-01", "2021-03-05"] then the first date returned would be "2021-03-01"

HelenCEBM commented 3 years ago
sebbacon commented 3 years ago

it produces inconsistencies (where no date limits are applied) between different values returned from the same query (e.g. binary_flag or numeric value could be present where date is not)

I didn't understand this, sorry. What's wrong with having a null date and a non-null binary_flag for one patient?

HelenCEBM commented 3 years ago

What's wrong with having a null date and a non-null binary_flag for one patient?

It's counterintuitive! I had no idea why counting different columns was giving me different results, because it didn't occur to me that dates of clinical events could be missing... but maybe that's just me.

HelenCEBM commented 3 years ago

Just to check: if the study_definition only has a filter for on_or_before for a particular variable we will miss any events with NULL dates in EMIS? But we will capture them in TPP as occurring in 1900.

HelenCEBM commented 3 years ago

Further detail on clinical codes in TPP with unknown dates:

EMIS unknown dates:

Overall

It is likely useful to include null dates (i.e. no on_or_after date) for:

When looking at the current or latest situation it may be advisable to always include a on_or_after date to exclude nulls. Examples may include:

HelenCEBM commented 3 years ago

@inglesp please could you run a similar query in EMIS to extract the top ~100 SNOMED codes with no known date? (The SQL I ran in TPP is here for reference but this will be simpler in EMIS as there are no odd CTV3 codes to fetch descriptions for).

inglesp commented 3 years ago

Here you go: snomedct_codes_without_dates.csv.txt

There are seven that we don't have terms for -- I presume these are EMIS local codes.