Have a discussion about study definition API design, using medicines data as an example

sebbacon commented 3 years ago

From https://github.com/opensafely-core/cohort-extractor/issues/570#issuecomment-858351078:

thought on OpenSAFELY design - do we rename libraries in line with underlying data e.g. should we make something called with_these_medications_dispensed or do we leave it as with_these_medications and it's up to folk to understand deeply the undelying data? Similar with patients.admitted_to_hospital

Context: we already have a medications table in TPP which shows what GPs have issued; we are now considering querying the BSA prescribing dataset, available in Databricks. The BSA dataset itself distinguishes between prescribed and dispensed medications; every medication listed by BSA is prescribed, not not every medication is dispensed. Note that some medications that are issued by a doctor never get presented for dispensing (e.g. an unused repeat prescription), so the lifetime of a prescription goes issued -> presented-for-dispensing (a.k.a. "prescribed" in the BSA dataset) -> dispensed, with some medications dropping off at each step.

In my view, our main API should describe high-level medical concepts that cut across difficult clinical datasets; I'd like it to be possible to hold the entire API in terms of function calls in your head, but have to refer to reference to see the details. Variations within this should be expressed as filters / keyword arguments.

Some quick scribble-on-napkin ideas:

with_these_medications(returning='issued_or_presented_or_dispensed') might be one option; returning the most recent state of any medication (then people can filter the input.csv afterwards).

Another would be with_these_medications(filter={'last_state':'dispensed'}).

Of course you might be interested in all medications issued but not dispensed. Or the time between the first issue of a medication, and its first dispensing event. Or the same, but for the most recent issue / dispensing.

Long term (i.e. not for the PoC) this also brings up the question of handling multiple data sources offering very similar data. If we have prescribed medications in TPP, and prescribed medications in GDPPR, and prescribed medications in the BSA data, then (a) should we make it possible for users to select just one dataset to query (b) where multiple are available and none is specified, what should we do?

with_these_medications(returning='source_dataset') and with_these_medications(filter={'last_state':'dispensed', 'source_dataset': 'bsa_prescriptions})`?

benbc commented 3 years ago

My immediate thought is that we should avoid introducing abstractions and information-hiding until we are sure that there is a need for it and we know how to do it in a way that makes sense.

I can imagine splitting researchers and/or their use cases into three categories:

those who are explicitly interested in a specific data source
those who understand the characteristics of the different data sources and know how they want to join across or compare them
those with a limited understanding of the underlying data who just want to know "the truth" in the domain without engaging with the details

We should be able to provide what 1 and 2 need without putting clever abstractions across the data sources -- indeed such abstractions might be detrimental to what they're trying to achieve.

The abstractions are needed for 3, but we may be able to defer satisfying that use case on the assumption that early users will be more sophisticated and the simpler or more sophisticated analysis may provide better return on investment.

evansd commented 3 years ago

Yes, what @benbc said!

I can imagine providing a higher-level function which is a union over all the different sources of rx data, but I wouldn't want to make that the primary way of accessing the data.

sebbacon commented 3 years ago

Hard to disagree with any of that!

However I have a long-standing question of how we find the right balance of abstraction vs explicit; how we decide when to exercise caution about premature design decisions vs introducing constraints early on. I find this genuinely hard to think about. I suppose I opened this issue as a prompt to try to work this out.

My concern is that if we allow people to do anything they may well ask us for the "wrong" thing. For example, we want reproducible, testable, verifiable, stateful-code-and-data, libraries; none of these are things our users are clamouring for. Part of the success of OpenSAFELY has (in my opinion) been our decisions about what you're NOT allowed to do. I could well be completely wrong, of course - it's all gut feeling.

One way of approaching it might be: where are we already doing abstractions and information-hiding? If so, what aspects have been a hindrance, and what aspects a help, with our early users? If we wanted to provide explicit granular access to power users, without applying our preformed ideas about needs, what would have happened if we just given them SQL from the start?

opensafely-core / cohort-extractor

Have a discussion about study definition API design, using medicines data as an example #572