Open inglesp opened 3 years ago
- A study author should be able to specify whether their study runs against either or both backends
I think the MVP is just doing it via the job server. It requires the least extra code and would be good enough; I think the only use case we know about know is "I want to run it on both backends ultimately, but while I'm testing it I'll just run it on one". Which is more suited to a runtime choice than a formally encoded choice.
- A study author should be able to write a single study definition that runs on both backends a. Some data is only available in one backend of the other
One option is to output nulls and log to the metadata (some kind of per-column "not implemented" token)
This has the advantage of a user potentially not having to commit any new code if/when a backend supports a particular column; just running again.
- It should be clear which backend the data came from
Outputs live in folders with metadata. #432 proposes to make per-input metadata available in the runtime. This would seem the simplest way?
- There should be a way to combine non-disclosive data from both backends
This is how I imagined it. The only challenge is picking a server to run it on. The first implementation should probably be "TPP" because that already has all the output-reviewing tools installed?
I think most of this is working in some way so could use an update? Perhaps some instructions for researchers on how to do this.
Yes. We're not going to invest any more time in new EMIS backend features just yet, so we should document the current state-of-the-art; I'd suggest in the Team Manual as this isn't yet suitable for public consumption.
Is this something you'd be able to draft (with blanks for bits you're not sure about) @HelenCEBM?
This issue is about supporting both TPP and EMIS in a single study, and not (say) also supporting an ONS backend.
Status: brain dump.
Goals
1. A study author should be able to specify whether their study runs against either or both backends
Options:
i. This could be specified in
project.yaml
ii. The backend could be specified in the job server iii. Some combination of i and ii2. A study author should be able to write a single study definition that runs on both backends
There are two kinds of difference between TPP and EMIS backends:
a. Some data is only available in one backend of the other b. Some data is available in very different ways in each backend
a. Some data is only available in one backend of the other
For instance, SUS data is not yet in EMIS (I think).
Options:
i. In a study definition that runs on both backends (perhaps indicated in
project.yaml
) users cannot use variables that use data that is only available in one backend ii. When data is not available in a backend, the implementation of variables that use that data returns NULLs iii. Study authors produce two study definitions, one for each backendI favour i, which would require study authors to have a separate study definition for just one backend, and for them to be responsible for working out how to combine the cohort that this generates with the rest of the data.
b. Some data is available in very different ways in each backend
The only example I can think of is vaccine data.
See #464.
3. It should be clear which backend the data came from
This is to provide some kind of audit trail for sense checking (that's probably a bigger discussion), as well as to allow study authors to handle data from different backends differently (which would be required for some of the options in 2a above).
Options:
i. Filenames could contain the name of the backend ii. There could be a column in
input.csv
files indicating the backend iii. Both?My preference is for i. To do this we'd have to make the backend available to scripts somehow (an environment variable?) and support patterns in filenames in
project.yaml
(egmy_file_{backend}.csv
)4. There should be a way to combine non-disclosive data from both backends
Assuming that non-disclosive data is in files named eg
xxx_{backend}.csv
, and these files are checked into git, we could have a step that depends on bothxxx_emis.csv
andxxx_tpp.csv
to producexxx_all.csv
. (Orxxx_emis_tpp.csv
?) Once both the input files are available, the jobserver could start a job to run this step on either backend.