Test Data Set for the QA Environment

Description of Issue

We have struggled in the past 6 months with creating/maintaining test data that is reliable to work with for development purposes. This was especially highlighted in the donor-submission-dashboard feature development.

Test data is important for several reasons:

Verifying bugs in in-progress features development
Regression testing during releases

Types of data that we need to be testing against during development in the QA Environment

Intermediate-song imports
Data integration between the platform and rdpc-api, named on the Donor dashboard (https://platform-ui.qa.argo.cancercollaboratory.org/submission/program/PACA-CA/dashboard)
Clinical data uploading.

Starting from a clean slate from each time is an unrealistic testing method for these scenarios, as we know that moving into production we will not be starting with a clean slate. That's why we need to have a test data that works, that we can manipulate. This has been working well for the clinical program in DASH-CA.

Factors that have come up

The RDPC API QA data is deleted in a rolling 2-week window. This can cause differences daily in data in the RDPC API QA data
We have been using the RDPC API PROD in the QA environment to facilitate development and testing. This has env details to discuss.
When a change is made in the RDPC API QA affecting the platform as a downstream service (namely the donor-dashboard-aggregator) we had to wait for the change to be made in production to accurately test it. This is bc the RDPC API QA does not have reliable data to test with.
Developers fort-forward to databases through kubernetes. (Ex: connecting to clinical). this is a non read-only operation, and could have descriptive actions if a developer is not careful. Should be a mechanism to do this.
Proposed Solutions:

Discussion with Dusan/Alex/Jon Geb 16, 20201

RDPC-Platform Test Data

Set the retention policy for RDPC QA to mimic production, that is Kafka retention forever, and backups to reinstitute the data in case of disaster
-- https://github.com/icgc-argo/workflow-roadmap/issues/99 -- https://github.com/icgc-argo/workflow-roadmap/issues/100
Test of test dataset that are sandboxed for QA use will be created on two couple test projects.
- The first TEST-QA will be for the whole team to use.
- The second ROSI-RU will be for Rosi to use in a curated dataset, potentially also used for demos. This data set should remain clean.
Provide analysis tsv's for Alex to run in Model T for both programs.

Intermediate-song

We should not be importing data into QA, as is happening right now with qa-intermediate-song. Don't use prod data in QA. QA should be clean.
We need a less manual process for importing legacy data. Not a long-term solution the way it is right now. ----- check with Christina on scale of the imports ----- spend time to develop an automated method that hardeep can use?

icgc-argo / roadmap