opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Better dummy data for date columns #1324

Open iaindillingham opened 1 year ago

iaindillingham commented 1 year ago

ehrQL has an open_prompt table.

https://github.com/opensafely-core/ehrql/blob/c1971f82f98a07990c2ff1f817ed11b2550b5cb6/ehrql/tables/beta/tpp.py#L478-L479

When generating dummy dates for open_prompt.consultation_date, the dummy data generator picks a date that’s within the last year, or within the last year of the patient’s lifetime, if the patient has died. (It’s a bit more complicated than that, but not much.)

https://github.com/opensafely-core/ehrql/blob/c1971f82f98a07990c2ff1f817ed11b2550b5cb6/ehrql/dummy_data/generator.py#L238-L249

The OpenPROMPT questionnaire went live on 2022-11-11, so the dummy dates are often out of range (i.e. the dummy dates are before the questionnaire went live). We might think this doesn’t matter; however, there are rows within the open_prompt table that don’t relate to the OpenPROMPT questionnaire. If we filter these rows out

open_prompt.where(open_prompt.consultation_date >= date(2022, 11, 11)

then we end up with a dataset that doesn’t have many dummy dates; instead, it has many missing values, which, anecdotally, makes developing downstream actions "very fiddly".

At this point, we could suggest that the user generates their own dummy data: either dummy input data or dummy output data. However, if I were the user, then I’d probably want better dummy data: that’s what I think @hendersonad wanted, anyhow. (He went on to generate his own dummy output data.)

How could we provide better dummy data for date columns?

See this Slack thread for more information. And this dataset definition, for an executable example.

inglesp commented 1 year ago

We need to think more about improving dummy data in general, particularly in terms of creating escape hatches that don't involve bailing out entirely.