Proposal: new approach to codelist managment

evansd commented 3 years ago

Current workflow

The current workflow for using codelists in an OpenSAFELY study definition looks something like this:

Create (or discover) a codelist on OpenCodelists
Add (part of) its URL to a file called codelists.txt
Run opensafely codelists update to download it
In a study definition, import the codelist_from_csv function
Load the codelist using its filename and specify the columns to read and the coding system in use.

As a bonus step, it looks like a pattern has emerged whereby users do the codelist_from_csv call in a file called codelists.py and then do from codelists import * in the study definition.

I think there are a few issues with this workflow:

It's a bit clunky for the user.
It makes it hard for someone looking at a study to jump quickly from the use of a codelist to the corresponding page on OpenCodelists. (And the from codelists import * pattern makes this even harder.)
Having to specify the coding system and columns again is not only duplicative work but it adds the possibility of user error.

Proposal

We no longer have a codelists.txt file.

Study authors should reference codelists in a study directly by URL e.g.

from cohortextractor import codelist

pulse_oximetry_codes = codelist(
    'https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/'
)

The opensafely codelists update command would get the URLs out of the study definitions and then download as normal (exact mechanics of this TBC, but I have some ideas).
We already store some codelist metadata in a codelists.json file. We extend this to include details of the coding system used and relevant columns so the user doesn't have to specify these again.

Additional notes:

We can continue to support codelists sourced from elsewhere by making the codelist function accept a path to a local CSV file, in which case the coding system and column argument would be required.
It might be possible to have something like the opensafely codelists update step run automatically when running the cohortextractor locally to reduce the steps required even further.
Some of this might require extra work on the OpenCodelists side as I'm not 100% sure that this currently provides all the metadata we need in an easily consumable way.

sebbacon commented 3 years ago

Looks good, quick observations:

You've not explicitly mentioned the current step where the use commits the specific version of the codelist data to the study repo. I assume this would stay the same.
This makes it a bit easier for people to try out using non-supported URLs. Do we want to define the contract expected of the URL, e.g. by adding (requiring?) a .csv extension on the URL? Do we want to assert that it's only www.opencondelists.org?
Backwards compatibility is super-simple as far as I can see.

alexwalkerepi commented 3 years ago

I think this is great and would substantially simplify the user experience. The column argument may still be required for opencodelists codelists, as for older codelists at least, there is no standardisation of the name of the column containing the codes.

Jongmassey commented 3 years ago

I like the sound of this, too.

This makes it a bit easier for people to try out using non-supported URLs. Do we want to define the contract expected of the URL, e.g. by adding (requiring?) a .csv extension on the URL? Do we want to assert that it's only www.opencondelists.org?

The ability to have opensafely codelists be aware of local codelists and not nuke them during an update would be nice. If there's steps required beyond "loading a flat-file" that apply only to opencodelists URLs (e.g. one might wish to load a third-party csv over http) then is there an argument for having an opencodelists.org-specific method?

sebbacon commented 3 years ago

This makes it a bit easier for people to try out using non-supported URLs. Do we want to define the contract expected of the URL, e.g. by adding (requiring?) a .csv extension on the URL? Do we want to assert that it's only www.opencondelists.org?

The ability to have opensafely codelists be aware of local codelists and not nuke them during an update would be nice. If there's steps required beyond "loading a flat-file" that apply only to opencodelists URLs (e.g. one might wish to load a third-party csv over http) then is there an argument for having an opencodelists.org-specific method?

I think there might be. I'd add, though, that while we should offer escape hatches, we shouldn't make them too easy. We want to encourage best practice through reuse and iteration and I'd like people to somehow justify (if only to themselves) opting out of this

benbc commented 3 years ago

You've not explicitly mentioned the current step where the use commits the specific version of the codelist data to the study repo.

Could the codelist URLs include a version?

evansd commented 3 years ago

Could the codelist URLs include a version?

Yes, it does already. That's the 72ce1380 bit on the end of https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/

benbc commented 3 years ago

Could the codelist URLs include a version?

Yes, it does already.

So Seb's suggestion that it stays the same is wrong? The user commits the URL to the repo. We don't ever need to put the codelist itself into the repo but can optionally cache it transparently.

evansd commented 3 years ago

I might be misunderstanding here, but I think what Seb was saying is correct in that he was just clarifying something I'd assumed but not spelled out explicitly. Codelists will still be downloaded and committed to the repo, it's just that the way the user specifies which codelists they need changes to be a more direct and less faffy.

benbc commented 3 years ago

Why commit them to the repo?

On Thu, 8 Jul 2021 at 15:48, Dave Evans @.***> wrote:

I might be misunderstanding here, but I think what Seb was saying is correct in that he was just clarifying something I'd assumed but not spelled out explicitly. Codelists will still be downloaded and committed to the repo, it's just that the way the user specifies which codelists they need changes to be a more direct and less faffy.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opensafely-core/cohort-extractor-v2/issues/31#issuecomment-876503323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7N3XNOHDOF7453J4D4ILTWW3EJANCNFSM5AAYGIJQ .

-- Ben Butler-Cole Platform Lead, The DataLab @.*** thedatalab.org https://thedatalab.lorg | @ebmdatalab https://twitter.com/ebmdatalab

sebbacon commented 3 years ago

Why commit them to the repo?

There's a principle that I had at the beginning that everything you need to understand a project should be in github - mainly because it was the only possible place to do this at the time, but also that takes care of longevity to some extent; and backing up for posterity is just one repo.

As we start to move away from github as the primary source of truth we could start to relax that, but I've a mild preference for not changing it until we have better archiving / backup procedures in place. Without some other change we break everything if we lose opencodelists.

Also it's nice to be able to run your code locally without network access, this guarantees this is possible following an initial clone. Again, not essential by a long shot.

evansd commented 3 years ago

it's nice to be able to run your code locally without network access

In this case it's more than nice, it's essential as it needs to run in the secure environment without any network access.

I think also for audit and transparency reasons we need to have the codelists as part of the committed code otherwise we can't guarantee to the public that the commit ID we publish shows the actual query that was run against the data.

And operationally it means that the OpenCodelists site isn't on the critical path for people getting work done in quite the same way.

So lots of benefits I think to keeping things working in this way.

inglesp commented 3 years ago

I like this, it reduces friction.

Some of this might require extra work on the OpenCodelists side as I'm not 100% sure that this currently provides all the metadata we need in an easily consumable way.

We'd need an API endpoint that returns metadata and codes, but this would be easy to write. Existing endpoints are at /api/v1/..., and we could either make the codelist() function convert a non-API URL into an API URL, or we could make non-API URLs return JSON given the right content header.

evansd commented 2 years ago

I think there's already general agreement that something like the above would be an improvement. But, for what it's worth, I've come to realise there's another reason that referencing codelists inline by URL is really important. And that is that without it we can't have variables expressed as self-contained chunks of "codelists plus logic", as Ben refers to them.

For example, take this imaginary asthma variable:

def has_asthma_at(date):
    asthma_codes = codelist_from_url(
        "https://www.opencodelists.org/codelist/opensafely/asthma-codes/72ce1380/"
    )
    asthma_meds = codelist_from_url(
        "https://www.opencodelists.org/codelist/opensafely/asthma-meds/72ce1380/"
    )
    has_previous_code = events.take(events.code.is_in(asthma_codes)).exists_for_patient()
    has_recent_meds = (
        meds.filter(meds.date >= date - days(90))
        .take(meds.codes.is_in(asthma_meds))
        .exists_for_patient()
    )
    return has_previous_code & has_recent_meds

In theory, you could copy this as-is into your own study and have it work. But under the old system that's not possible. You'd have to open the codelists.py file to find the CSVs these codelists came from, then dig out the codelists.txt file to find the URLs these CSVs came from, and then copy all the relevant bits into your study.

And even if you don't want to copy a variable but just want to understand how it's defined, having things defined inline makes it much easier to jump back to OpenCodelists to find out more about the construction of the codelists.

opensafely-core / ehrql

Proposal: new approach to codelist managment #31

Current workflow

Proposal