opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

How should equality between frames behave? #2141

Open evansd opened 3 weeks ago

evansd commented 3 weeks ago

This is fairly niche, but came up when some experienced Python programmers were first learning ehrQL. They wondered whether two frame operations produced the same thing and so tried writing frame_a == frame_b.

At present, this just uses the default Python object identity behaviour. There are two reasonable alternatives to this:

  1. Compare the related query model objects (i.e. the _qm_node attributes). This is the semantically correct thing to do: query model nodes are value objects and will compare equal iff they are structurally identical.

  2. Blow up with an error: ehrQL series objects have special equality behaviour where the result is a new series. Comparing a series with a frame already triggers a (hopefully) explanatory error. Comparing two frames is more likely to be a typo/thinko by an novice ehrQList than it is to be the investigation of a curious Pythonista, so raising an error is the more helpful thing to do.

I think I'm leaning in favour of option 2. Adding guardrails for novices is particularly important in our context. And experienced Pythonistas will be better placed to cope with the unexpected error.

I'd also add that, in the specific context in which this came up, the users were confused by the unexpected False result. But that result was actually correct in the sense that the two expression they assumed were semantically equivalent were not actually equivalent. So it's not clear that doing the semantically correct thing would actually have been helpful in any case.

Slack thread: https://bennettoxford.slack.com/archives/C069YDR4NCA/p1727966431279409?thread_ts=1727963482.698439&cid=C069YDR4NCA

madwort commented 3 weeks ago

+1 for option 2, my guess is that you're right & that in production ehrQL this usage would be a thinko.

Heading off on a tangent about option 1: How & when does ehrQL simplify the graph of a query? I haven't checked the details, but in the specific situation that this came up, the two frame definitions were possibly equivalent rather than equal ("where_not equals" vs "where not_equals") - they would ideally need to be in a standard form to do the best job of option 1?

Another tangent - this ticket is about what should we do for the syntax frame == frame. If we had an aim to compare the results of frames, do we currently have a way to do that (spitballing: frame.get_results() == frame.get_results())? Not for "production ehrQL usage" but for researchers to learn & explore in the sandbox.

(discussion of tangents could move to Slack rather than this ticket!)

evansd commented 3 weeks ago

the two frame definitions were possibly equivalent rather than equal

My point was that, unless I misunderstood the example, they aren't equivalent. The expression:

frame.except_where(condition)

Is equivalent to:

frame.where(~condition | condition.is_null())

Which it has to be if it's going to act as the inverse of where(). So it is semantically distinct from:

frame.where(~condition)

More generally though you're right that will be semantically equivalent queries (e.g. ~(a & b) and ~a | ~b) which are compare non-equal as they're structurally different.


On comparing the results of frames in the sandbox, I wonder if we could override the __iter__ method in the same way we do the __str__ method. So you could directly compare the contents of frames with:

list(frame_a) == list(frame_b)
madwort commented 2 weeks ago

they aren't equivalent

ahhhhhhhh I hadn't spotted that. In which case the idea of comparing the results of frames in the sandbox could be more misleading than it is helpful. Scratch all that!