opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
38 stars 13 forks source link

Add IG permissions to project.yaml #172

Open sebbacon opened 4 years ago

sebbacon commented 4 years ago

Example IG approvals:

    # relevant IG approvals for the project; delete sections as not applicable (values as examples only)
    governance_process={
        {       panel:     "SAGE",
               reference: "2020-04-01",
               citation:  "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/",
       },
        {        panel:     "Institutional Ethics",
               reference: "UoX-1234-56",
               citation:  "https://www.foo.ac.uk/somefile.pdf"
       },
        {        panel:     "NHS Digital / IGARD",
               reference: "DARS-123-4567-8900-v1.1",
               citation:  "https://digital.nhs.uk/binaries/content/assets/website-assets/corporate-information/corporate-information-and-documents/igard/igard-minutes---28th-may-2020-final.pdf.pdf",
       },
        {       panel:     "HRA / CAG",
               reference: "99/CAG/1234",
               patient_dissent_override: "yes: CAG approval"
               citation:  "https://www.hra.nhs.uk/documents/1715/CAG_Meeting_-_07_February_2019.pdf",
       },
    ,}
evansd commented 4 years ago

I guess one question is whether this all goes into the study definition itself. Given that we've already hit cases where we want a single repo but multiple study definitions I suspect the answer is no. But it would be nice to have some conceptually neat answer as to what belongs in the study definition and what belongs in the config.

sebbacon commented 4 years ago

Here is a straw man:


name: Factors associated with Covid-19 in hospital admissions in England

study:
  # In the following actions, `id` is used like a Makefile target,
  # i.e. if it exists the step is skipped.
  #
  # Arguments provided from elsewere in the framework
  actions:
    - name: Asthma
      id: asthma_study_definition
      run:
        operation: generate_cohort
        version: 0.12.6

    - name: COPD
      id: copd_study_definition
      run:
        operation: generate_cohort
        version: 0.12.6

    - name: Postprocess files
      id: postprocess
      needs: [asthma_study_definition, copd_study_definition]
      runner:
        operation: process-stata
        version: 0.3
        args: analysis/postprocess.do
      outputs:
        - path: intermediate.dta
          for_publication: false

    - name: Create crosstabs
      id: crosstabs
      needs: postprocess
      runner:
        operation: process-stata
        version: 0.3
      script: analysis/crosstabs.do
      outputs:
        - path: crosstabs.log
          for_publication: false
        - path: crosstabs_output.txt
          for_publication: true

    - name: Create regression
      needs: postprocess
      runner:
        operation: process-python
        version: 0.5
      script: analysis/regression.py
      outputs:
        - path: regression.log
          for_publication: false
        - path: regression_output.md
          for_publication: true
        - path: km_curve.png
          for_publication: true

# Relevant IG approvals for the project
governance:
  - panel: SAGE
    reference: "2020-03-01"
    citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
  - panel: Institutional Ethics
    reference: "xyz123"
    citation: "https://ox.ac.uk/blah"

organisations:
  - name: University of Oxford
  - name: London School of Hygiene and Tropical Medicine

principal_investigator:
  - name: Ben Goldacre
    email: ben.goldacre@phc.ox.ac.uk

These settings would be used by the job-runner framework to construct and run docker commands (in combination with arguments passed via the REST job server), and would redirect output to predictable folders.

The outputs are used like Makefile targets, but also for tooling around (future) output egress functionality.

amirmehrkar commented 4 years ago

Threads in my thinking:

Note: The DPIA will define: the legal bases under GDPR/DPA18 used to process data and how Common Law Duty of Confidence is addressed; who can access; method of processing and access; type of output. [At this stage Data Processing Agreement is dealt with; most privacy notices will be - unless new data sets come on board where their data providers needs privacy notices updating]

Approval and Prioritisation - by our governance process (currently in development; PI Ben and Liam; will also have NHSE involvement)

IS STUDY IN SCOPE for OpenSAFELY?

  1. Does study fit overall PURPOSE aligned with DPIA 1a. Legal Bases and setting aside of Common law here, but it might be possible these will differ or change over time (e.g. for common law: S.251; COPI; Not required if data sufficiently anonymous) or some datasets we link to in future may be based on consent to share e.g. PHR data) 1b. Was another panel or body involved in providing approval to share data? e.g. NHS Digital DAR/IGARD; CAG? Reference?
  2. Is there ethics approval? Body? Reference?
  3. Are data sharing agreements in place for the dataset? With whom (e.g. NHS Digital; ONS; etc.)
  4. Does dataset support study definition requirements?
  5. Are privacy notices adequate or need updating?
  6. Who will conduct the study operationally (incumbent team members or new ones to onboard?) 6b. If new team members: ensure they hold signed/dated honorary contracts with NHS England and have signed the Data Access Agreements?
  7. Have researchers signed OpenSAFELY code of conduct (to be discussed/determined) e.g. Open sharing of data; Study protocols; codelists; approach to press release?

DECISION CHECKPOINT: approval and then prioritisation (requires a framework)

  1. this will weigh up the backgrounds (organisations) of the requestors; perceived merit of the research; cost? etc.

Initiation of Research

  1. Start date and establish access mechanisms (VPNs; passwords; 2FA) as required
  2. Consider end date for access depending on number of studies involved in.

During Research

  1. Consideration given to review of access audit logs

Pre-publication

  1. Review of data/paper by governance panel including Data Controller
sebbacon commented 4 years ago

Thanks Amir. For our currently-approved studies, could you have a go at listing the relevant approvals, references and perhaps URLs when you get a chance?

My straw man would start like this:

governance:
  - panel: SAGE
    reference: "2020-03-01"
    citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
  - panel: Institutional Ethics
    reference: "xyz123"
    citation: "https://ox.ac.uk/blah"
evansd commented 4 years ago

I think broadly speaking all the metadata stuff looks sensible. The only thing I'd add would be a top-level format_version: 1.0 or something like that. Just so we can safely make changes to the config format and still interpret (and even programmatically rewrite) older config files.

For actions there are various bikesheddy changes I'd make but all up for discussion. In general, where we can mirror the syntax of Github Actions I think we might as well do that. The changes are:

  1. Make actions a dict with the top-levels keys as job ids, rather than a list where each member has an id attribute. This is what GA does and it's slightly tidier I think.

  2. Specify the version of the study definition file inside the study definition itself, rather than in the actions.

  3. Rather than runner, operation and args, specify a run argument as a shell invocation. I think it's really helpful to have something that can potentially be used verbatim to run locally.

  4. Follow GA in using runs-on to specify the Docker image used. This would obviously be one from a list of pre-built images that we supply. The default would be an opensafely-base image which contains the cohortextractor and probably Python and some other bits and bobs.

Rewritten example would look like:

actions:
  generate_asthma_cohort:
    name: Asthma
    run: cohortextractor extract analysis/asthma_study_definition.yaml
    outputs:
      - output/asthma_cohort.csv

  generate_copd_cohort:
    name: COPD
    run: cohortextractor extract analysis/copd_study_definition.yaml
    outputs:
      - output/copd_cohort.csv

  postprocess:
    name: Postprocess files
    needs: [asthma_study_definition, copd_study_definition]
    runs-on: opensafely-stata-16
    run: stata analysis/postprocess.do
    outputs:
       - path: intermediate.dta
         for_publication: false
sebbacon commented 4 years ago

Thanks! This is really useful

Specify the version of the study definition file inside the study definition itself, rather than in the actions.

This makes sense for the cohortextractor tool, but not so much for arbitrary python or Stata scripts?

evansd commented 4 years ago

I think their are two ways we could do this. One is to have the version in the docker image name (e.g opensafely-stata-16) which would be set by the runs-on attribute. The other would be to specify it as part of the run attribute e.g. python3.7 my_script.py and build multiple versions of Python into a single docker image.

amirmehrkar commented 4 years ago

My straw man would start like this:

Example for risk factors study. I suspect we should ask the authors of the study protocols to fill this out. I could do for ICS once we think this works (and obviously evolve it with time if needed).

governance:
  - data controller: NHS England 
  - data providers and datasets: NHS England (CPNS); GP surgeries supplied by TPP (GP data); ONS (ONS death) //do you want this broken down?
  - data processor: TPP //(and / or other future EHRs vendors)
  - data sharing agreement (between NHSE and other): ONS (ONS Death)
  - legal basis (GDPR/DPA 2018): Article 6(1)(e); Article 9(2)(h), Article 9(2)(i), Article 9(2)(j)
  - common law duty of confidentiality:  set aside under section 251 NHS Act 2006, regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002
   - DPIA citation: done - not available 
    citation: https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-share-information
  - other approval panels: N/A //e.g. DARS/IGARD; CAG; National Data Guardian; ICO.
  - oversight: OpenSAFELY oversight board; day to day approval process (NHSE and Datalab)
  - study requestors: N/A // e.g. SAGE, NHSE, PHE, other research groups, etc.
  - study reference ID: 20200301OS1 //e.g. [Date received + requester Org ID + number], number if multiple from same requestors on same day; here OS = OpenSAFELY advisory group]
    citation: link to study protocol [here GitHub]
  - researchers: University of Oxford DataLab / London School Hygiene and Tropical Medicine EHR Group
  - contract type: NHS England (honorary); Data Access Agreement //could have option of "None - delivered by OS contract holders"
  - code of conduct: N/A //not done this yet [option ideally, "signed"]
  - ethics panel: London School Hygiene and Tropical Medicine
    reference: (21863)
    citation: not available
  - ethics panel: Health Research Authority
    reference: 20/LO/0651
    citation: https://drive.google.com/file/d/1fDYWJuglIbqE_LFN63wyBd9G1HBYf6ck/view _//ideally put on website if no external URL_
   - privacy notice citation: https://www.england.nhs.uk/contact-us/privacy-notice/how-we-use-your-information/covid-19-response/coronavirus-covid-19-research-platform/
  - data retention period (pseudonymised): Sept 30 2020 - this is a property of COPI in our case and got extended
  - data retention period (de-identified): 2 years
sebbacon commented 4 years ago

Looks great, some quick thoughts:

amirmehrkar commented 4 years ago
  1. Yes, Study Reference ID for us to track at OS level in anticipation of many coming through an more formal tracking process

  2. Yes we could publish the DPIA as a citation - the privacy statement is a must however. The DPIA would possibly at times require some redactions for certain risks we want to withhold or commercial sensitivities. We are likely to find most of the content in a good privacy notice and good security section but it is a good summary. Let's put that in as a heading - I'll amend.

  3. With whom (ie did we) have a data sharing agreement - so often data providers want a data sharing agreement as a governance step, so in the RF study as we only used ONS data as the external provider, I've included us having the data sharing agreement with them .

sebbacon commented 3 years ago

WIP notes

governance:

WHICH ARE REQUIRED?

sebbacon commented 3 years ago

Oversight gov, project gov, dataset gov

1) Is asking and checking all the other properties of the project 2) is about ethics etc 3) is about about DSA etc

sebbacon commented 3 years ago

@ghickman just came across this issue which is relevant to your thinking about onboarding data models.

We've discussed in passing before: fundamentally it implies there will be a link between user -> DSA -> permitted underlying tables.

Currently every user has one DSA which is with NHSE who give permission to all underlying tables, but this will change