opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Add a `cluster_by_episode()` operation on date series #2198

Open evansd opened 3 weeks ago

evansd commented 3 weeks ago

The intention here is to provide a more flexible, composable alternative to the existing count_episodes_for_patient aggregation.

The idea is that given a date event series like this:

date
2020-01-01
2020-01-04
2020-01-06
2020-01-10
2020-01-12

We could "cluster" it into episodes, defining the maximum gap allowed between episodes like so:

date.cluster_by_episode(days(3))

And this would produce a new event frame with two columns giving the start and end dates of each episode:

start_date end_date
2020-01-01 2020-01-06
2020-01-10 2020-01-12

Calling count_for_patient() on this frame would give the same result as the existing count_episodes_for_patient() aggregation. But all sorts of other operation are now possible e.g. calculating total episode length and including only those over a certain length:

episodes = events.date.cluster_by_episode(days(3))
episode_lengths = episodes.end_date - episode.start_date
long_episodes = episodes.where(episode_lengths.days >= 14)

Original Slack discussion with motivating use cases: https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1729785637716699

Implementation notes

This is a new kind of operation in the sense that it takes a series and returns a frame. It's also unusual in that the frame class it returns is not one already defined in a schema module. In the query_language module we'll probably want to define a dedicated frame class for this with start_date and end_date series.

Note that this operation is not an aggregation (it doesn't result in a one-row-per-patient result) so we shouldn't use the _for_patient() suffix in naming it.

I don't have a clear sense of how fiddly or not this will be to implement in SQL. The existing episode counting code uses a LAG window function to do it's work. It may be that a simple adaption of this code will do what we need. Or there may turn out to be some significant blocker I haven't thought of.