opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
6 stars 3 forks source link

Better support the filter/reduce pattern #1645

Open iaindillingham opened 10 months ago

iaindillingham commented 10 months ago

In ehrQL, it's common to filter an event frame by a codelist, and then to reduce it to a patient frame by sorting by date and selecting either the first or the last row. For example:

dataset.ethnicity = (
    clinical_events.where(
        clinical_events.ctv3_code.is_in(ethnicity_codelist)
    )
    .sort_by(clinical_events.date)
    .last_for_patient()
    .ctv3_code.to_category(ethnicity_codelist)
)

Because it's common, we've talked about adding a helper method. Such a helper method would make the previous example shorter. It would also make writing the previous example less error-prone, as it's easy to forget the sort_by: that is, to write .where().last_for_patient() rather than .where().sort_by().last_for_patient().

The previous example is an instance of the filter/reduce pattern. However, it's not the only instance. In the following example, we filter an event frame by a codelist and by a date, and reduce it to a patient frame by counting the number of occurrences.

dataset.num_asthma_inhaler_medications = (
    medications.where(
        medications.dmd_code.is_in(asthma_inhaler_codelist)
    )
    .where(
        medications.date.is_on_or_between(
            index_date - days(30), index_date
        )
    )
    .count_for_patient()
)

The helper method doesn't help 🙁. Indeed, the helper method may hinder a user's ability to learn ehrQL, because it's easier to notice the filter-reduce pattern when it's not hidden behind a helper method.[^1]

Rather than adding a helper method, we could make it easier to remember the sort_by. We could do so by combining sort_by and last_for_patient. For example:

dataset.ethnicity = (
    clinical_events.where(
        clinical_events.ctv3_code.is_in(ethnicity_codelist)
    )
    .last_for_patient(sort_by=clinical_events.date)  # 👀 
    .ctv3_code.to_category(ethnicity_codelist)
)

We could also warn the user when either first_for_patient or last_for_patient is used without a sort_by. By doing so, we would better support the filter/reduce pattern without hindering a user's ability to learn ehrQL.

@cewalters -- who suggested combining sort_by and last_for_patient -- says this would be more familiar to R users. Thanks for your help writing this issue, Caroline 🙏🏻

[^1]: You're going to have to take my word for it. If you would like to talk more about why noticing is important for learning, then please ask 🙂.

evansd commented 10 months ago

One possibility I've wondered about is to allow tables which have a single "natural" date column to be pre-sorted by that column and also to have shorthand filter methods for dates like:

events.where_on_or_before("2020-01-01")

As an alias for:

events.where(events.date.is_on_or_before("2020-01-01")

That certainly helps to make examples shorter e.g

dataset.num_asthma_inhaler_medications = (
    medications.where(
        medications.dmd_code.is_in(asthma_inhaler_codelist)
    )
    .where_on_or_between(index_date - days(30), index_date)
    .last_for_patient()
    .dmd_code
)

And it uses a more generalisable approach than having special case helper methods.

But it may still suffer from the problem of obscuring rather than clarifying the semantics here.