opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Support `.where()` on PatientFrames #1733

Open evansd opened 11 months ago

evansd commented 11 months ago

The desired behaviour here should be pretty self-explanatory. We want queries like this to work in the obvious way:

filtered_frame = patient_frame.where(patient_frame.condition)
dataset.series_a = filtered_frame.series_a
dataset.series_b = filtered_frame.series_b

The reason for wanting it is ergonomic: it's possible to build logically equivalent queries using case expressions, for example:

dataset.series_a = when(patient_frame.condtion).then(patient_frame.series_a).otherwise(None)
dataset.series_b = when(patient_frame.condtion).then(patient_frame.series_b).otherwise(None)

But that requires repeating the condition, and doesn't allow combining multiple conditions by chaining. It's also generally better to allow users to achieve logically equivalent things using the same mechanisms as far as possible, rather than requiring them to do things one way in one context and another way in a different context.

Technically, I think this is going to be medium difficulty. Many parts of it will be very straightforward but updating BaseSQLQueryEngine to support this might involve some slightly mind-stretching reasoning.

Initial discussion in Slack thread: https://bennettoxford.slack.com/archives/C03FB777L1M/p1700050532117739


Implementation notes

Changing the ehrQL surface and the query model rules should be fairly easy. We'd need to move the where() and except_where() methods from EventFrame to BaseFrame: https://github.com/opensafely-core/ehrql/blob/4add3ee29148205bf53a61a517ef63bbd0d1688f/ehrql/query_language.py#L1028-L1056

And change Filter to accept and return Frame rather than ManyRowsPerPatientFrame: https://github.com/opensafely-core/ehrql/blob/4add3ee29148205bf53a61a517ef63bbd0d1688f/ehrql/query_model/nodes.py#L233-L235

And then extend the spec tests to cover some PatientFrame examples: https://github.com/opensafely-core/ehrql/blob/4add3ee29148205bf53a61a517ef63bbd0d1688f/tests/spec/filter/test_where.py#L7

I think the InMemoryQueryEngine may work out of the box here: it seems to have filter methods already defined in all the relevant places. If not, it should be reasonably straightforward to extend.

We'll also need to extend the generative tests so that they produce examples of Filter being applied to a OneRowPerPatientFrame: https://github.com/opensafely-core/ehrql/blob/4add3ee29148205bf53a61a517ef63bbd0d1688f/tests/generative/variable_strategies.py#L448-L467

The challenge will come with BaseSQLQueryEngine. I haven't quite got my head around what changes will be needed here. But it's going to involve looking at the get_select_query_for_node_domain method and the various places that gets called:

https://github.com/opensafely-core/ehrql/blob/4add3ee29148205bf53a61a517ef63bbd0d1688f/ehrql/query_engines/base_sql.py#L702-L706

evansd commented 11 months ago

Alternative implementation idea

@inglesp Has suggested it might be possible to do this purely in the query language by using where() as syntactic sugar for building case expressions.

I think this would involve adding a _conditions list to PatientFrame and each call to where() creates a new frame with the new condition appended to this list. Then accessing column attributes on this frame would return a case() expression which took the column value if all the conditions are met and NULL otherwise.