opensafely / tpp-sql-notebook

2 stars 0 forks source link

Proof of Concept for new Covariates library #53

Open evansd opened 4 years ago

evansd commented 4 years ago

This rewrites the Analysis MVP notebook using our new datalab_covariates library. The internals of this library will no doubt change dramatically but hopefully we can keep the public API fairly stable.

I've kept the notebook as close as possible to its existing form but replaced various SQL queries and some Pandas logic with calls to our library. It's probably best to look at the diff for notebooks/diffable_python/analysis.py to see the kinds of changes involved.

sebbacon commented 4 years ago

@CarolineMorton:

we have been given a lot of read codes or med codes in the form of csv files (or excel) see issues... I wondered is there scope to point to a csv to read in the codes, as well as a dataframe read in option. just to add flexibility?

This is something we need to discuss with @inglesp this morning. The goal is for us to package what @evansd has written into a standalone module; add tests; and integrate with some kind of "codelist" module (TBC).

The ultimate vision is that our study cohort definition file would look something like this:

from peters_codelist_thing import codelist
import daves_cohort_thing as dct

cvd_meds = codelist("qof:cvd_meds", coding_system="snomed", version="1.2")
chd_codes = codelist("lshtm:chd_clinical_codes", coding_system="ctv3", version="1.5")
smoking_codes = codelist(
    "smoking_clinical_codes", coding_system="ctv3", version="latest"
)

model_input_definition = {
    "cvd_meds": dct.patients_with_these_medications("cvd_meds", snomed_codes=cvd_meds),
    "chd_code": dct.patients_with_these_clinical_events(ctv3_codes=chd_codes),
    "age_and_sex": dct.patients_with_age_and_sex("today"),
    "smoking_status": dct.patients_with_these_clinical_events(
        ctv3_codes=smoking_codes, min_date="2015-01-01", max_date="2020-03-31"
    ),
}

And then our workflow would do something like

from definition import model_input_definition
import daves_cohort_thing as dct

dct.generate_model_input_definition(model_input_definition)

(This as a straw man only; the point is something like the above is all a statistician would need to write to generate either dummy data for playing with, or real data to run the real model on)