Add IG permissions to project.yaml

sebbacon commented 4 years ago

Example IG approvals:

    # relevant IG approvals for the project; delete sections as not applicable (values as examples only)
    governance_process={
        {       panel:     "SAGE",
               reference: "2020-04-01",
               citation:  "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/",
       },
        {        panel:     "Institutional Ethics",
               reference: "UoX-1234-56",
               citation:  "https://www.foo.ac.uk/somefile.pdf"
       },
        {        panel:     "NHS Digital / IGARD",
               reference: "DARS-123-4567-8900-v1.1",
               citation:  "https://digital.nhs.uk/binaries/content/assets/website-assets/corporate-information/corporate-information-and-documents/igard/igard-minutes---28th-may-2020-final.pdf.pdf",
       },
        {       panel:     "HRA / CAG",
               reference: "99/CAG/1234",
               patient_dissent_override: "yes: CAG approval"
               citation:  "https://www.hra.nhs.uk/documents/1715/CAG_Meeting_-_07_February_2019.pdf",
       },
    ,}

evansd commented 4 years ago

I guess one question is whether this all goes into the study definition itself. Given that we've already hit cases where we want a single repo but multiple study definitions I suspect the answer is no. But it would be nice to have some conceptually neat answer as to what belongs in the study definition and what belongs in the config.

sebbacon commented 4 years ago

Here is a straw man:


name: Factors associated with Covid-19 in hospital admissions in England

study:
  # In the following actions, `id` is used like a Makefile target,
  # i.e. if it exists the step is skipped.
  #
  # Arguments provided from elsewere in the framework
  actions:
    - name: Asthma
      id: asthma_study_definition
      run:
        operation: generate_cohort
        version: 0.12.6

    - name: COPD
      id: copd_study_definition
      run:
        operation: generate_cohort
        version: 0.12.6

    - name: Postprocess files
      id: postprocess
      needs: [asthma_study_definition, copd_study_definition]
      runner:
        operation: process-stata
        version: 0.3
        args: analysis/postprocess.do
      outputs:
        - path: intermediate.dta
          for_publication: false

    - name: Create crosstabs
      id: crosstabs
      needs: postprocess
      runner:
        operation: process-stata
        version: 0.3
      script: analysis/crosstabs.do
      outputs:
        - path: crosstabs.log
          for_publication: false
        - path: crosstabs_output.txt
          for_publication: true

    - name: Create regression
      needs: postprocess
      runner:
        operation: process-python
        version: 0.5
      script: analysis/regression.py
      outputs:
        - path: regression.log
          for_publication: false
        - path: regression_output.md
          for_publication: true
        - path: km_curve.png
          for_publication: true

# Relevant IG approvals for the project
governance:
  - panel: SAGE
    reference: "2020-03-01"
    citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
  - panel: Institutional Ethics
    reference: "xyz123"
    citation: "https://ox.ac.uk/blah"

organisations:
  - name: University of Oxford
  - name: London School of Hygiene and Tropical Medicine

principal_investigator:
  - name: Ben Goldacre
    email: ben.goldacre@phc.ox.ac.uk

These settings would be used by the job-runner framework to construct and run docker commands (in combination with arguments passed via the REST job server), and would redirect output to predictable folders.

The outputs are used like Makefile targets, but also for tooling around (future) output egress functionality.

amirmehrkar commented 4 years ago

Threads in my thinking:

Note: The DPIA will define: the legal bases under GDPR/DPA18 used to process data and how Common Law Duty of Confidence is addressed; who can access; method of processing and access; type of output. [At this stage Data Processing Agreement is dealt with; most privacy notices will be - unless new data sets come on board where their data providers needs privacy notices updating]

Approval and Prioritisation - by our governance process (currently in development; PI Ben and Liam; will also have NHSE involvement)

IS STUDY IN SCOPE for OpenSAFELY?

Does study fit overall PURPOSE aligned with DPIA 1a. Legal Bases and setting aside of Common law here, but it might be possible these will differ or change over time (e.g. for common law: S.251; COPI; Not required if data sufficiently anonymous) or some datasets we link to in future may be based on consent to share e.g. PHR data) 1b. Was another panel or body involved in providing approval to share data? e.g. NHS Digital DAR/IGARD; CAG? Reference?
Is there ethics approval? Body? Reference?
Are data sharing agreements in place for the dataset? With whom (e.g. NHS Digital; ONS; etc.)
Does dataset support study definition requirements?
Are privacy notices adequate or need updating?
Who will conduct the study operationally (incumbent team members or new ones to onboard?) 6b. If new team members: ensure they hold signed/dated honorary contracts with NHS England and have signed the Data Access Agreements?
Have researchers signed OpenSAFELY code of conduct (to be discussed/determined) e.g. Open sharing of data; Study protocols; codelists; approach to press release?

DECISION CHECKPOINT: approval and then prioritisation (requires a framework)

this will weigh up the backgrounds (organisations) of the requestors; perceived merit of the research; cost? etc.

Initiation of Research

Start date and establish access mechanisms (VPNs; passwords; 2FA) as required
Consider end date for access depending on number of studies involved in.

During Research

Consideration given to review of access audit logs

Pre-publication

Review of data/paper by governance panel including Data Controller

sebbacon commented 4 years ago

Thanks Amir. For our currently-approved studies, could you have a go at listing the relevant approvals, references and perhaps URLs when you get a chance?

My straw man would start like this:

governance:
  - panel: SAGE
    reference: "2020-03-01"
    citation: "https://www.gov.uk/government/groups/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response/minutes/"
  - panel: Institutional Ethics
    reference: "xyz123"
    citation: "https://ox.ac.uk/blah"

evansd commented 4 years ago

I think broadly speaking all the metadata stuff looks sensible. The only thing I'd add would be a top-level format_version: 1.0 or something like that. Just so we can safely make changes to the config format and still interpret (and even programmatically rewrite) older config files.

For actions there are various bikesheddy changes I'd make but all up for discussion. In general, where we can mirror the syntax of Github Actions I think we might as well do that. The changes are:

Make actions a dict with the top-levels keys as job ids, rather than a list where each member has an id attribute. This is what GA does and it's slightly tidier I think.
Specify the version of the study definition file inside the study definition itself, rather than in the actions.
Rather than runner, operation and args, specify a run argument as a shell invocation. I think it's really helpful to have something that can potentially be used verbatim to run locally.
Follow GA in using runs-on to specify the Docker image used. This would obviously be one from a list of pre-built images that we supply. The default would be an opensafely-base image which contains the cohortextractor and probably Python and some other bits and bobs.

Rewritten example would look like:

actions:
  generate_asthma_cohort:
    name: Asthma
    run: cohortextractor extract analysis/asthma_study_definition.yaml
    outputs:
      - output/asthma_cohort.csv

  generate_copd_cohort:
    name: COPD
    run: cohortextractor extract analysis/copd_study_definition.yaml
    outputs:
      - output/copd_cohort.csv

  postprocess:
    name: Postprocess files
    needs: [asthma_study_definition, copd_study_definition]
    runs-on: opensafely-stata-16
    run: stata analysis/postprocess.do
    outputs:
       - path: intermediate.dta
         for_publication: false

sebbacon commented 4 years ago

Thanks! This is really useful

Specify the version of the study definition file inside the study definition itself, rather than in the actions.

This makes sense for the cohortextractor tool, but not so much for arbitrary python or Stata scripts?

evansd commented 4 years ago

I think their are two ways we could do this. One is to have the version in the docker image name (e.g opensafely-stata-16) which would be set by the runs-on attribute. The other would be to specify it as part of the run attribute e.g. python3.7 my_script.py and build multiple versions of Python into a single docker image.

amirmehrkar commented 4 years ago

My straw man would start like this:

Example for risk factors study. I suspect we should ask the authors of the study protocols to fill this out. I could do for ICS once we think this works (and obviously evolve it with time if needed).

governance:
  - data controller: NHS England 
  - data providers and datasets: NHS England (CPNS); GP surgeries supplied by TPP (GP data); ONS (ONS death) //do you want this broken down?
  - data processor: TPP //(and / or other future EHRs vendors)
  - data sharing agreement (between NHSE and other): ONS (ONS Death)
  - legal basis (GDPR/DPA 2018): Article 6(1)(e); Article 9(2)(h), Article 9(2)(i), Article 9(2)(j)
  - common law duty of confidentiality:  set aside under section 251 NHS Act 2006, regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002
   - DPIA citation: done - not available 
    citation: https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-share-information
  - other approval panels: N/A //e.g. DARS/IGARD; CAG; National Data Guardian; ICO.
  - oversight: OpenSAFELY oversight board; day to day approval process (NHSE and Datalab)
  - study requestors: N/A // e.g. SAGE, NHSE, PHE, other research groups, etc.
  - study reference ID: 20200301OS1 //e.g. [Date received + requester Org ID + number], number if multiple from same requestors on same day; here OS = OpenSAFELY advisory group]
    citation: link to study protocol [here GitHub]
  - researchers: University of Oxford DataLab / London School Hygiene and Tropical Medicine EHR Group
  - contract type: NHS England (honorary); Data Access Agreement //could have option of "None - delivered by OS contract holders"
  - code of conduct: N/A //not done this yet [option ideally, "signed"]
  - ethics panel: London School Hygiene and Tropical Medicine
    reference: (21863)
    citation: not available
  - ethics panel: Health Research Authority
    reference: 20/LO/0651
    citation: https://drive.google.com/file/d/1fDYWJuglIbqE_LFN63wyBd9G1HBYf6ck/view _//ideally put on website if no external URL_
   - privacy notice citation: https://www.england.nhs.uk/contact-us/privacy-notice/how-we-use-your-information/covid-19-response/coronavirus-covid-19-research-platform/
  - data retention period (pseudonymised): Sept 30 2020 - this is a property of COPI in our case and got extended
  - data retention period (de-identified): 2 years

sebbacon commented 4 years ago

Looks great, some quick thoughts:

Whose is the study reference ID? That's one we've made up right?
Would we provide DPIAs or are these typically secret for whatever reason?
What does " - data sharing agreement: ONS" mean?

amirmehrkar commented 4 years ago

Yes, Study Reference ID for us to track at OS level in anticipation of many coming through an more formal tracking process
Yes we could publish the DPIA as a citation - the privacy statement is a must however. The DPIA would possibly at times require some redactions for certain risks we want to withhold or commercial sensitivities. We are likely to find most of the content in a good privacy notice and good security section but it is a good summary. Let's put that in as a heading - I'll amend.
With whom (ie did we) have a data sharing agreement - so often data providers want a data sharing agreement as a governance step, so in the RF study as we only used ONS data as the external provider, I've included us having the data sharing agreement with them .

sebbacon commented 3 years ago

WIP notes

governance:

"OUR" data controller: NHS England (always the same). In the future there might be more than one data controller. Property of the project.
data providers and datasets: sources (ONS, etc)
data processor: always TPP
data sharing agreement: we have no agreement with GPs. Every dataset that comes in should have a DSA, with exception of GPs.
every dataset has to have an originating controller
every DSA must be between "OUR" controller and another controller. So for ONS death: data controller is ONS, ISARIC it's university of oxford, etc.
legal basis for the dataset (GDPR/DPA 2018): Article 6(1)(e); Article 9(2)(h), Article 9(2)(i), Article 9(2)(j) (two needed: "personal" (art 6) and "sensitive" (art 9); (happens to be the same for all datasets at the moment)
common law duty of confidentiality: property of a dataset set aside by this is COPI regulation 3 under section 251 NHS Act 2006, regulation 3(4) of the Health Service (Control of Patient Information) Regulations 2002. This is important. When you share data there are 2 aspects - this is the case-law version, about demonstrable harm. There is another power which can be used which is CAG approval.
- DPIA citation: done - not shareable. A document link. No requirement to share it. A property of the entire project but could apply at the dataset level if you wanted. In our case, of the opensafely project
citation: https://www.gov.uk/government/publications/coronavirus-covid-19-notification-of-data-controllers-to-share-information
other approval panels: N/A //e.g. DARS/IGARD; CAG; National Data Guardian; ICO. This is not a required, just for info, for example part of getting a DSA might include going through particular panels.
Platform oversight: board includes NHSE and checks and approves individual project. Would be some kind of approval document that we will grant a project. not legal requirement. good practice. OpenSAFELY advisory group. Property of the project.
study requestors: N/A // e.g. SAGE, NHSE, PHE, other research groups, etc. Just for info. Reason why we're doing it etc.
study reference ID: 20200301OS1 //e.g. [Date received + requester Org ID + number], number if multiple from same requestors on same day; here OS = OpenSAFELY advisory group]: link to an official registry entry for protocol or similar AND a reference we generate?
citation: link to study protocol [here GitHub]
researchers: University of Oxford DataLab / London School Hygiene and Tropical Medicine EHR Group
contract type: NHS England (honorary); Data Access Agreement //could have option of "None - delivered by OS contract holders". Property of the project. At the moment they're all NHSE or None. in the future it may be that they sign a DSA agreement with NHSE that makes them a data controller in common with NHSE. So this is kind of "who gets sued" - MISS OUT FOR NOW
code of conduct: N/A //not done this yet [option ideally, "signed"]
ethics panel: London School Hygiene and Tropical Medicine - per project. We've used the same ethics approval for all opensafely so far. The following two for everyting. reference: (21863) citation: not available
ethics panel: Health Research Authority reference: 20/LO/0651 citation: https://drive.google.com/file/d/1fDYWJuglIbqE_LFN63wyBd9G1HBYf6ck/view //ideally put on website if no external URL
- privacy notice citation: REQUIRED in GDPR, per-data-controller https://www.england.nhs.uk/contact-us/privacy-notice/how-we-use-your-information/covid-19-response/coronavirus-covid-19-research-platform/ and possibly per-dataset (not required, but good practice)
data retention period (pseudonymised): Sept 30 2020 property of the platform AND datasets/DSA. (this derived from the COPI notice)
data retention period (de-identified): 2 years property of the DSA

WHICH ARE REQUIRED?

sebbacon commented 3 years ago

Oversight gov, project gov, dataset gov

1) Is asking and checking all the other properties of the project 2) is about ethics etc 3) is about about DSA etc

sebbacon commented 3 years ago

@ghickman just came across this issue which is relevant to your thinking about onboarding data models.

We've discussed in passing before: fundamentally it implies there will be a link between user -> DSA -> permitted underlying tables.

Currently every user has one DSA which is with NHSE who give permission to all underlying tables, but this will change

opensafely-core / cohort-extractor

Add IG permissions to project.yaml #172