opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Support mixed direction sorts #2223

Open evansd opened 1 week ago

evansd commented 1 week ago

Problem

Our current API for sorts (the sort_by() and first/last_for_patient() methods) doesn't allow specifying a mixture of ascending and descending sorts.

This means that while we can write queries like "find events with the largest numeric value and return the latest":

events.sort_by(events.numeric_value, events.date).last_for_patient()

We can't straightforwardly support queries like "find events with the largest numeric value and then return the earliest of these" because that requires sorting numeric value and date in opposite directions.

For this particular example we can work around this by inverting the float which has the effect of reversing the sort direction:

events.sort_by(-events.numeric_value, events.date).first_for_patient()

But that trick is not applicable to many other types (e.g you can't invert a date).

Slack discussion: https://bennettoxford.slack.com/archives/C069YDR4NCA/p1731501153163389

Possible API designs

We could allow sort_by() to take a reverse keyword argument. This is simple and matches Python's sorted() function. It allows you to support mixed direction sorts by chaining e.g. to find events with the largest numeric value and then return the earliest of these you could do:

events.sort_by(events.date, reverse=True).sort_by(events.numeric_value).last_for_patient()

However, chaining sorts like this is extremely counter-intuitive as you need to apply them in backwards order of priority, and remember which you have applied reverse to. I had to think quite hard to persuade myself that the above example is correct.

Another alternative is to have some wrapper like desc() which causes wrapped columns to sort in descending order as opposed to the default of ascending. (SQLAlchemy does something like this.) The same example would then look something like:

events.sort_by(desc(events.numeric_value), events.date).first_for_patient()

I'm sure there are other options we're missing as well. We should make sure we've explored the design space well here before settling on anything.

madwort commented 1 week ago

chaining sorts

I really don't like this & don't think we should suggest this to researchers. As discussed elsewhere, you also need to know that the first sort is stable after the second sort.

wrapper like desc()

would we have a corresponding (but superfluous) asc() wrapper? and/or I'd be tempted to require sort order everywhere, but I think that's going to be too prescriptive for your taste?

events.sort_by(desc(events.numeric_value), asc(events.date)).first_for_patient()
events.sort_by(asc(events.date)).first_for_patient()