Open evansd opened 3 years ago
Looks good, quick observations:
.csv
extension on the URL? Do we want to assert that it's only www.opencondelists.org
?I think this is great and would substantially simplify the user experience. The column argument may still be required for opencodelists codelists, as for older codelists at least, there is no standardisation of the name of the column containing the codes.
I like the sound of this, too.
This makes it a bit easier for people to try out using non-supported URLs. Do we want to define the contract expected of the URL, e.g. by adding (requiring?) a .csv extension on the URL? Do we want to assert that it's only www.opencondelists.org?
The ability to have opensafely codelists
be aware of local codelists and not nuke them during an update
would be nice. If there's steps required beyond "loading a flat-file" that apply only to opencodelists URLs (e.g. one might wish to load a third-party csv over http) then is there an argument for having an opencodelists.org-specific method?
This makes it a bit easier for people to try out using non-supported URLs. Do we want to define the contract expected of the URL, e.g. by adding (requiring?) a .csv extension on the URL? Do we want to assert that it's only www.opencondelists.org?
The ability to have opensafely codelists be aware of local codelists and not nuke them during an update would be nice. If there's steps required beyond "loading a flat-file" that apply only to opencodelists URLs (e.g. one might wish to load a third-party csv over http) then is there an argument for having an opencodelists.org-specific method?
I think there might be. I'd add, though, that while we should offer escape hatches, we shouldn't make them too easy. We want to encourage best practice through reuse and iteration and I'd like people to somehow justify (if only to themselves) opting out of this
You've not explicitly mentioned the current step where the use commits the specific version of the codelist data to the study repo.
Could the codelist URLs include a version?
Could the codelist URLs include a version?
Yes, it does already. That's the 72ce1380
bit on the end of
https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/
Could the codelist URLs include a version?
Yes, it does already.
So Seb's suggestion that it stays the same is wrong? The user commits the URL to the repo. We don't ever need to put the codelist itself into the repo but can optionally cache it transparently.
I might be misunderstanding here, but I think what Seb was saying is correct in that he was just clarifying something I'd assumed but not spelled out explicitly. Codelists will still be downloaded and committed to the repo, it's just that the way the user specifies which codelists they need changes to be a more direct and less faffy.
Why commit them to the repo?
On Thu, 8 Jul 2021 at 15:48, Dave Evans @.***> wrote:
I might be misunderstanding here, but I think what Seb was saying is correct in that he was just clarifying something I'd assumed but not spelled out explicitly. Codelists will still be downloaded and committed to the repo, it's just that the way the user specifies which codelists they need changes to be a more direct and less faffy.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opensafely-core/cohort-extractor-v2/issues/31#issuecomment-876503323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7N3XNOHDOF7453J4D4ILTWW3EJANCNFSM5AAYGIJQ .
-- Ben Butler-Cole Platform Lead, The DataLab @.*** thedatalab.org https://thedatalab.lorg | @ebmdatalab https://twitter.com/ebmdatalab
Why commit them to the repo?
There's a principle that I had at the beginning that everything you need to understand a project should be in github - mainly because it was the only possible place to do this at the time, but also that takes care of longevity to some extent; and backing up for posterity is just one repo.
As we start to move away from github as the primary source of truth we could start to relax that, but I've a mild preference for not changing it until we have better archiving / backup procedures in place. Without some other change we break everything if we lose opencodelists.
Also it's nice to be able to run your code locally without network access, this guarantees this is possible following an initial clone. Again, not essential by a long shot.
it's nice to be able to run your code locally without network access
In this case it's more than nice, it's essential as it needs to run in the secure environment without any network access.
I think also for audit and transparency reasons we need to have the codelists as part of the committed code otherwise we can't guarantee to the public that the commit ID we publish shows the actual query that was run against the data.
And operationally it means that the OpenCodelists site isn't on the critical path for people getting work done in quite the same way.
So lots of benefits I think to keeping things working in this way.
I like this, it reduces friction.
Some of this might require extra work on the OpenCodelists side as I'm not 100% sure that this currently provides all the metadata we need in an easily consumable way.
We'd need an API endpoint that returns metadata and codes, but this would be easy to write. Existing endpoints are at /api/v1/..., and we could either make the codelist()
function convert a non-API URL into an API URL, or we could make non-API URLs return JSON given the right content header.
I think there's already general agreement that something like the above would be an improvement. But, for what it's worth, I've come to realise there's another reason that referencing codelists inline by URL is really important. And that is that without it we can't have variables expressed as self-contained chunks of "codelists plus logic", as Ben refers to them.
For example, take this imaginary asthma variable:
def has_asthma_at(date):
asthma_codes = codelist_from_url(
"https://www.opencodelists.org/codelist/opensafely/asthma-codes/72ce1380/"
)
asthma_meds = codelist_from_url(
"https://www.opencodelists.org/codelist/opensafely/asthma-meds/72ce1380/"
)
has_previous_code = events.take(events.code.is_in(asthma_codes)).exists_for_patient()
has_recent_meds = (
meds.filter(meds.date >= date - days(90))
.take(meds.codes.is_in(asthma_meds))
.exists_for_patient()
)
return has_previous_code & has_recent_meds
In theory, you could copy this as-is into your own study and have it work. But under the old system that's not possible. You'd have to open the codelists.py
file to find the CSVs these codelists came from, then dig out the codelists.txt
file to find the URLs these CSVs came from, and then copy all the relevant bits into your study.
And even if you don't want to copy a variable but just want to understand how it's defined, having things defined inline makes it much easier to jump back to OpenCodelists to find out more about the construction of the codelists.
Current workflow
The current workflow for using codelists in an OpenSAFELY study definition looks something like this:
codelists.txt
opensafely codelists update
to download itcodelist_from_csv
functionAs a bonus step, it looks like a pattern has emerged whereby users do the
codelist_from_csv
call in a file calledcodelists.py
and then dofrom codelists import *
in the study definition.I think there are a few issues with this workflow:
from codelists import *
pattern makes this even harder.)Proposal
We no longer have a
codelists.txt
file.Study authors should reference codelists in a study directly by URL e.g.
The
opensafely codelists update
command would get the URLs out of the study definitions and then download as normal (exact mechanics of this TBC, but I have some ideas).We already store some codelist metadata in a
codelists.json
file. We extend this to include details of the coding system used and relevant columns so the user doesn't have to specify these again.Additional notes:
codelist
function accept a path to a local CSV file, in which case the coding system and column argument would be required.opensafely codelists update
step run automatically when running the cohortextractor locally to reduce the steps required even further.