Open sebbacon opened 4 years ago
I guess one question is whether this all goes into the study definition itself. Given that we've already hit cases where we want a single repo but multiple study definitions I suspect the answer is no. But it would be nice to have some conceptually neat answer as to what belongs in the study definition and what belongs in the config.
Here is a straw man:
name: Factors associated with Covid-19 in hospital admissions in England
study:
# In the following actions, `id` is used like a Makefile target,
# i.e. if it exists the step is skipped.
#
# Arguments provided from elsewere in the framework
actions:
- name: Asthma
id: asthma_study_definition
run:
operation: generate_cohort
version: 0.12.6
- name: COPD
id: copd_study_definition
run:
operation: generate_cohort
version: 0.12.6
- name: Postprocess files
id: postprocess
needs: [asthma_study_definition, copd_study_definition]
runner:
operation: process-stata
version: 0.3
args: analysis/postprocess.do
outputs:
- path: intermediate.dta
for_publication: false
- name: Create crosstabs
id: crosstabs
needs: postprocess
runner:
operation: process-stata
version: 0.3
script: analysis/crosstabs.do
outputs:
- path: crosstabs.log
for_publication: false
- path: crosstabs_output.txt
for_publication: true
- name: Create regression
needs: postprocess
runner:
operation: process-python
version: 0.5
script: analysis/regression.py
outputs:
- path: regression.log
for_publication: false
- path: regression_output.md
for_publication: true
- path: km_curve.png
for_publication: true
# Relevant IG approvals for the project
governance:
- panel: SAGE
reference: "2020-03-01"
citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
- panel: Institutional Ethics
reference: "xyz123"
citation: "https://ox.ac.uk/blah"
organisations:
- name: University of Oxford
- name: London School of Hygiene and Tropical Medicine
principal_investigator:
- name: Ben Goldacre
email: ben.goldacre@phc.ox.ac.uk
These settings would be used by the job-runner
framework to construct and run docker commands (in combination with arguments passed via the REST job server), and would redirect output to predictable folders.
The outputs
are used like Makefile targets, but also for tooling around (future) output egress functionality.
Threads in my thinking:
Note: The DPIA will define: the legal bases under GDPR/DPA18 used to process data and how Common Law Duty of Confidence is addressed; who can access; method of processing and access; type of output. [At this stage Data Processing Agreement is dealt with; most privacy notices will be - unless new data sets come on board where their data providers needs privacy notices updating]
Approval and Prioritisation - by our governance process (currently in development; PI Ben and Liam; will also have NHSE involvement)
IS STUDY IN SCOPE for OpenSAFELY?
DECISION CHECKPOINT: approval and then prioritisation (requires a framework)
Initiation of Research
During Research
Pre-publication
Thanks Amir. For our currently-approved studies, could you have a go at listing the relevant approvals, references and perhaps URLs when you get a chance?
My straw man would start like this:
governance:
- panel: SAGE
reference: "2020-03-01"
citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
- panel: Institutional Ethics
reference: "xyz123"
citation: "https://ox.ac.uk/blah"
I think broadly speaking all the metadata stuff looks sensible. The only thing I'd add would be a top-level format_version: 1.0
or something like that. Just so we can safely make changes to the config format and still interpret (and even programmatically rewrite) older config files.
For actions
there are various bikesheddy changes I'd make but all up for discussion. In general, where we can mirror the syntax of Github Actions I think we might as well do that. The changes are:
Make actions
a dict with the top-levels keys as job ids, rather than a list where each member has an id
attribute. This is
what GA does and it's slightly tidier I think.
Specify the version of the study definition file inside the study definition itself, rather than in the actions.
Rather than runner
, operation
and args
, specify a run
argument as a shell invocation. I think it's really helpful to have something that can potentially be used verbatim to run locally.
Follow GA in using runs-on
to specify the Docker image used. This would obviously be one from a list of pre-built images that we supply. The default would be an opensafely-base
image which contains the cohortextractor and probably Python and some other bits and bobs.
Rewritten example would look like:
actions:
generate_asthma_cohort:
name: Asthma
run: cohortextractor extract analysis/asthma_study_definition.yaml
outputs:
- output/asthma_cohort.csv
generate_copd_cohort:
name: COPD
run: cohortextractor extract analysis/copd_study_definition.yaml
outputs:
- output/copd_cohort.csv
postprocess:
name: Postprocess files
needs: [asthma_study_definition, copd_study_definition]
runs-on: opensafely-stata-16
run: stata analysis/postprocess.do
outputs:
- path: intermediate.dta
for_publication: false
Thanks! This is really useful
Specify the version of the study definition file inside the study definition itself, rather than in the actions.
This makes sense for the cohortextractor
tool, but not so much for arbitrary python or Stata scripts?
I think their are two ways we could do this. One is to have the version in the docker image name (e.g opensafely-stata-16
) which would be set by the runs-on
attribute. The other would be to specify it as part of the run
attribute e.g. python3.7 my_script.py
and build multiple versions of Python into a single docker image.
My straw man would start like this:
Example for risk factors study. I suspect we should ask the authors of the study protocols to fill this out. I could do for ICS once we think this works (and obviously evolve it with time if needed).
governance:
- data controller: NHS England
- data providers and datasets: NHS England (CPNS); GP surgeries supplied by TPP (GP data); ONS (ONS death) //do you want this broken down?
- data processor: TPP //(and / or other future EHRs vendors)
- data sharing agreement (between NHSE and other): ONS (ONS Death)
- legal basis (GDPR/DPA 2018): Article 6(1)(e); Article 9(2)(h), Article 9(2)(i), Article 9(2)(j)
- common law duty of confidentiality: set aside under section 251 NHS Act 2006, regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002
- DPIA citation: done - not available
citation: https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-share-information
- other approval panels: N/A //e.g. DARS/IGARD; CAG; National Data Guardian; ICO.
- oversight: OpenSAFELY oversight board; day to day approval process (NHSE and Datalab)
- study requestors: N/A // e.g. SAGE, NHSE, PHE, other research groups, etc.
- study reference ID: 20200301OS1 //e.g. [Date received + requester Org ID + number], number if multiple from same requestors on same day; here OS = OpenSAFELY advisory group]
citation: link to study protocol [here GitHub]
- researchers: University of Oxford DataLab / London School Hygiene and Tropical Medicine EHR Group
- contract type: NHS England (honorary); Data Access Agreement //could have option of "None - delivered by OS contract holders"
- code of conduct: N/A //not done this yet [option ideally, "signed"]
- ethics panel: London School Hygiene and Tropical Medicine
reference: (21863)
citation: not available
- ethics panel: Health Research Authority
reference: 20/LO/0651
citation: https://drive.google.com/file/d/1fDYWJuglIbqE_LFN63wyBd9G1HBYf6ck/view _//ideally put on website if no external URL_
- privacy notice citation: https://www.england.nhs.uk/contact-us/privacy-notice/how-we-use-your-information/covid-19-response/coronavirus-covid-19-research-platform/
- data retention period (pseudonymised): Sept 30 2020 - this is a property of COPI in our case and got extended
- data retention period (de-identified): 2 years
Looks great, some quick thoughts:
Yes, Study Reference ID for us to track at OS level in anticipation of many coming through an more formal tracking process
Yes we could publish the DPIA as a citation - the privacy statement is a must however. The DPIA would possibly at times require some redactions for certain risks we want to withhold or commercial sensitivities. We are likely to find most of the content in a good privacy notice and good security section but it is a good summary. Let's put that in as a heading - I'll amend.
With whom (ie did we) have a data sharing agreement - so often data providers want a data sharing agreement as a governance step, so in the RF study as we only used ONS data as the external provider, I've included us having the data sharing agreement with them .
WIP notes
governance:
"OUR" data controller: NHS England (always the same). In the future there might be more than one data controller. Property of the project.
data providers and datasets: sources (ONS, etc)
data processor: always TPP
data sharing agreement: we have no agreement with GPs. Every dataset that comes in should have a DSA, with exception of GPs.
every dataset has to have an originating controller
every DSA must be between "OUR" controller and another controller. So for ONS death: data controller is ONS, ISARIC it's university of oxford, etc.
legal basis for the dataset (GDPR/DPA 2018): Article 6(1)(e); Article 9(2)(h), Article 9(2)(i), Article 9(2)(j) (two needed: "personal" (art 6) and "sensitive" (art 9); (happens to be the same for all datasets at the moment)
common law duty of confidentiality: property of a dataset set aside by this is COPI regulation 3 under section 251 NHS Act 2006, regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002. This is important. When you share data there are 2 aspects - this is the case-law version, about demonstrable harm. There is another power which can be used which is CAG approval.
other approval panels: N/A //e.g. DARS/IGARD; CAG; National Data Guardian; ICO. This is not a required, just for info, for example part of getting a DSA might include going through particular panels.
Platform oversight: board includes NHSE and checks and approves individual project. Would be some kind of approval document that we will grant a project. not legal requirement. good practice. OpenSAFELY advisory group. Property of the project.
study requestors: N/A // e.g. SAGE, NHSE, PHE, other research groups, etc. Just for info. Reason why we're doing it etc.
study reference ID: 20200301OS1 //e.g. [Date received + requester Org ID + number], number if multiple from same requestors on same day; here OS = OpenSAFELY advisory group]: link to an official registry entry for protocol or similar AND a reference we generate?
citation: link to study protocol [here GitHub]
researchers: University of Oxford DataLab / London School Hygiene and Tropical Medicine EHR Group
contract type: NHS England (honorary); Data Access Agreement //could have option of "None - delivered by OS contract holders". Property of the project. At the moment they're all NHSE or None. in the future it may be that they sign a DSA agreement with NHSE that makes them a data controller in common with NHSE. So this is kind of "who gets sued" - MISS OUT FOR NOW
code of conduct: N/A //not done this yet [option ideally, "signed"]
ethics panel: London School Hygiene and Tropical Medicine - per project. We've used the same ethics approval for all opensafely so far. The following two for everyting. reference: (21863) citation: not available
ethics panel: Health Research Authority reference: 20/LO/0651 citation: https://drive.google.com/file/d/1fDYWJuglIbqE_LFN63wyBd9G1HBYf6ck/view //ideally put on website if no external URL
data retention period (pseudonymised): Sept 30 2020 property of the platform AND datasets/DSA. (this derived from the COPI notice)
data retention period (de-identified): 2 years property of the DSA
WHICH ARE REQUIRED?
Oversight gov, project gov, dataset gov
1) Is asking and checking all the other properties of the project 2) is about ethics etc 3) is about about DSA etc
@ghickman just came across this issue which is relevant to your thinking about onboarding data models.
We've discussed in passing before: fundamentally it implies there will be a link between user -> DSA -> permitted underlying tables.
Currently every user has one DSA which is with NHSE who give permission to all underlying tables, but this will change
Example IG approvals: