Open sebbacon opened 3 years ago
My immediate thought is that we should avoid introducing abstractions and information-hiding until we are sure that there is a need for it and we know how to do it in a way that makes sense.
I can imagine splitting researchers and/or their use cases into three categories:
We should be able to provide what 1 and 2 need without putting clever abstractions across the data sources -- indeed such abstractions might be detrimental to what they're trying to achieve.
The abstractions are needed for 3, but we may be able to defer satisfying that use case on the assumption that early users will be more sophisticated and the simpler or more sophisticated analysis may provide better return on investment.
Yes, what @benbc said!
I can imagine providing a higher-level function which is a union over all the different sources of rx data, but I wouldn't want to make that the primary way of accessing the data.
Hard to disagree with any of that!
However I have a long-standing question of how we find the right balance of abstraction vs explicit; how we decide when to exercise caution about premature design decisions vs introducing constraints early on. I find this genuinely hard to think about. I suppose I opened this issue as a prompt to try to work this out.
My concern is that if we allow people to do anything they may well ask us for the "wrong" thing. For example, we want reproducible, testable, verifiable, stateful-code-and-data, libraries; none of these are things our users are clamouring for. Part of the success of OpenSAFELY has (in my opinion) been our decisions about what you're NOT allowed to do. I could well be completely wrong, of course - it's all gut feeling.
One way of approaching it might be: where are we already doing abstractions and information-hiding? If so, what aspects have been a hindrance, and what aspects a help, with our early users? If we wanted to provide explicit granular access to power users, without applying our preformed ideas about needs, what would have happened if we just given them SQL from the start?
From https://github.com/opensafely-core/cohort-extractor/issues/570#issuecomment-858351078:
Context: we already have a medications table in TPP which shows what GPs have issued; we are now considering querying the BSA prescribing dataset, available in Databricks. The BSA dataset itself distinguishes between prescribed and dispensed medications; every medication listed by BSA is prescribed, not not every medication is dispensed. Note that some medications that are issued by a doctor never get presented for dispensing (e.g. an unused repeat prescription), so the lifetime of a prescription goes issued -> presented-for-dispensing (a.k.a. "prescribed" in the BSA dataset) -> dispensed, with some medications dropping off at each step.
In my view, our main API should describe high-level medical concepts that cut across difficult clinical datasets; I'd like it to be possible to hold the entire API in terms of function calls in your head, but have to refer to reference to see the details. Variations within this should be expressed as filters / keyword arguments.
Some quick scribble-on-napkin ideas:
with_these_medications(returning='issued_or_presented_or_dispensed')
might be one option; returning the most recent state of any medication (then people can filter theinput.csv
afterwards).Another would be
with_these_medications(filter={'last_state':'dispensed'})
.Of course you might be interested in all medications issued but not dispensed. Or the time between the first issue of a medication, and its first dispensing event. Or the same, but for the most recent issue / dispensing.
Long term (i.e. not for the PoC) this also brings up the question of handling multiple data sources offering very similar data. If we have prescribed medications in TPP, and prescribed medications in GDPPR, and prescribed medications in the BSA data, then (a) should we make it possible for users to select just one dataset to query (b) where multiple are available and none is specified, what should we do?
with_these_medications(returning='source_dataset')
andwith_these_medications(filter={'last_state':'dispensed', 'source_dataset': 'bsa_prescriptions
})`?