Better dummy data for date columns

ehrQL has an open_prompt table.

https://github.com/opensafely-core/ehrql/blob/c1971f82f98a07990c2ff1f817ed11b2550b5cb6/ehrql/tables/beta/tpp.py#L478-L479

When generating dummy dates for open_prompt.consultation_date, the dummy data generator picks a date that’s within the last year, or within the last year of the patient’s lifetime, if the patient has died. (It’s a bit more complicated than that, but not much.)

https://github.com/opensafely-core/ehrql/blob/c1971f82f98a07990c2ff1f817ed11b2550b5cb6/ehrql/dummy_data/generator.py#L238-L249

The OpenPROMPT questionnaire went live on 2022-11-11, so the dummy dates are often out of range (i.e. the dummy dates are before the questionnaire went live). We might think this doesn’t matter; however, there are rows within the open_prompt table that don’t relate to the OpenPROMPT questionnaire. If we filter these rows out

open_prompt.where(open_prompt.consultation_date >= date(2022, 11, 11)

then we end up with a dataset that doesn’t have many dummy dates; instead, it has many missing values, which, anecdotally, makes developing downstream actions "very fiddly".

At this point, we could suggest that the user generates their own dummy data: either dummy input data or dummy output data. However, if I were the user, then I’d probably want better dummy data: that’s what I think @hendersonad wanted, anyhow. (He went on to generate his own dummy output data.)

How could we provide better dummy data for date columns?

See this Slack thread for more information. And this dataset definition, for an executable example.

opensafely-core / ehrql

Better dummy data for date columns #1324