opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
38 stars 13 forks source link

Support TPP and EMIS backends in same study #465

Open inglesp opened 3 years ago

inglesp commented 3 years ago

This issue is about supporting both TPP and EMIS in a single study, and not (say) also supporting an ONS backend.

Status: brain dump.

Goals

  1. A study author should be able to specify whether their study runs against either or both backends
  2. A study author should be able to write a single study definition that runs on both backends
  3. It should be clear which backend the data came from
  4. There should be a way to combine non-disclosive data from both backends

1. A study author should be able to specify whether their study runs against either or both backends

Options:

i. This could be specified in project.yaml ii. The backend could be specified in the job server iii. Some combination of i and ii

2. A study author should be able to write a single study definition that runs on both backends

There are two kinds of difference between TPP and EMIS backends:

a. Some data is only available in one backend of the other b. Some data is available in very different ways in each backend

a. Some data is only available in one backend of the other

For instance, SUS data is not yet in EMIS (I think).

Options:

i. In a study definition that runs on both backends (perhaps indicated in project.yaml) users cannot use variables that use data that is only available in one backend ii. When data is not available in a backend, the implementation of variables that use that data returns NULLs iii. Study authors produce two study definitions, one for each backend

I favour i, which would require study authors to have a separate study definition for just one backend, and for them to be responsible for working out how to combine the cohort that this generates with the rest of the data.

b. Some data is available in very different ways in each backend

The only example I can think of is vaccine data.

See #464.

3. It should be clear which backend the data came from

This is to provide some kind of audit trail for sense checking (that's probably a bigger discussion), as well as to allow study authors to handle data from different backends differently (which would be required for some of the options in 2a above).

Options:

i. Filenames could contain the name of the backend ii. There could be a column in input.csv files indicating the backend iii. Both?

My preference is for i. To do this we'd have to make the backend available to scripts somehow (an environment variable?) and support patterns in filenames in project.yaml (eg my_file_{backend}.csv)

4. There should be a way to combine non-disclosive data from both backends

Assuming that non-disclosive data is in files named eg xxx_{backend}.csv, and these files are checked into git, we could have a step that depends on both xxx_emis.csv and xxx_tpp.csv to produce xxx_all.csv. (Or xxx_emis_tpp.csv?) Once both the input files are available, the jobserver could start a job to run this step on either backend.

sebbacon commented 3 years ago
  1. A study author should be able to specify whether their study runs against either or both backends

I think the MVP is just doing it via the job server. It requires the least extra code and would be good enough; I think the only use case we know about know is "I want to run it on both backends ultimately, but while I'm testing it I'll just run it on one". Which is more suited to a runtime choice than a formally encoded choice.

  1. A study author should be able to write a single study definition that runs on both backends a. Some data is only available in one backend of the other

One option is to output nulls and log to the metadata (some kind of per-column "not implemented" token)

This has the advantage of a user potentially not having to commit any new code if/when a backend supports a particular column; just running again.

  1. It should be clear which backend the data came from

Outputs live in folders with metadata. #432 proposes to make per-input metadata available in the runtime. This would seem the simplest way?

  1. There should be a way to combine non-disclosive data from both backends

This is how I imagined it. The only challenge is picking a server to run it on. The first implementation should probably be "TPP" because that already has all the output-reviewing tools installed?

HelenCEBM commented 3 years ago

I think most of this is working in some way so could use an update? Perhaps some instructions for researchers on how to do this.

sebbacon commented 3 years ago

Yes. We're not going to invest any more time in new EMIS backend features just yet, so we should document the current state-of-the-art; I'd suggest in the Team Manual as this isn't yet suitable for public consumption.

sebbacon commented 3 years ago

Is this something you'd be able to draft (with blanks for bits you're not sure about) @HelenCEBM?

HelenCEBM commented 3 years ago

Draft here