Modelling deletes & updates in metamist

EddieLF commented 3 months ago

Problem Statement

Data managers should have the ability to update and delete records from Metamist. Currently, there is limited ability to update records, and where there is, it's mostly confined to meta fields. There is no ability to delete records without accessing the Cloud VM running Metamist and executing SQL statements, which can only be done by certain software team members. This is an issue because the software team does not have full visibility over data management issues and requirements, which lie outside of their scope of responsibility anyway.

Solution Proposition

The development of a "Deletes" module for Metamist is proposed in this issue. A separate module for "Updates" is also discussed, however this is not the primary target of this issue. An "Updates" module is likely to be closely related to a "Deletes" module in terms of the logic used and infrastructure required, so this may naturally extend the scope of this issue throughout its development and implementation.

Explicitly, the module proposed would consist of:

Specific endpoints to delete records from each table in Metamist
Logic to control the order of execution for the endpoints, to ensure all foreign key dependencies are taken into account
Robust logging to keep deletes transparent and well reported

The module should take simple inputs, e.g. record IDs, and then handle the complex logic and dependency resolution behind the scenes, collecting records to delete, then actioning these deletes in the correct order, and logging the results transparently.

At first, the scope of this module should be limited to test datasets. The inherent dangers of releasing API endpoints that can delete records poses too great a risk to production data to allow the module to be used on production datasets - until such a time that the permissions structure is more strongly defined and tested.

Discussion

Incongruities between data managers & software team

Data managers & other analysts currently have no autonomy to delete records in the database without direct intervention from members of the Software team. However within the software team, there is understandably lower visibility or knowledge of ad hoc data management issues originating in the analysis teams. This has led to a system where the knowledge of what needs to be deleted and the permissions to action the deletes are siloed between the teams in a very inefficient way.

Metamist assigns users roles which determine their access to datasets and their read/write permissions of the data within. There are currently no roles or modules that allow for data deletion via Metamist, and very few that allow for updating records. Where update modules do exist, they are generally confined to updating fields within the same table.

CAN UPDATE:
- participant reported_sex can be updated
- sample meta fields can be upserted.
CAN'T UPDATE
- participant’s family cannot be updated.
- The participant that a sample belongs to cannot be updated.

The largest barrier to developing and implementing an updating / deleting module is within the metamist schema and data model. Metamist is built on a hierarchical ordering of tables based on the number of foreign key dependencies of each table. If changes to the database records don’t take into account the schema, or the order in which the tables are affected, the changes will not succeed. Additionally, the way the Metamist tables are set up is such that when two tables have a relationship, this is modelled by a third table that stores the primary keys of each table as foreign keys. For example, the assay table and the sequencing_group table link their records through the sequencing_group_assay table.

Foreign Key Dependencies in Metamist

From db/connect.py, we have the Metamist tables ordered by number of foreign key dependencies. The tables with the most dependencies are listed at the top, and those with the least are listed at the bottom.

TABLES_ORDERED_BY_FK_DEPS = [
    'project',
    'group',
    'analysis',
    'participant',
    'sample',
    'sequencing_group',
    'assay',
    'sequencing_group_assay',
    'analysis_sequencing_group',
    'analysis_sample',
    'assay_external_id',
    'participant_external_id',
    'sample_external_id',
    'sequencing_group_external_id',
    'family',
    'family_participant',
    'participant_phenotypes',
    'group_member',
    'cohort_template',
    'cohort',
    'cohort_sequencing_group',
    'analysis_cohort',
    'analysis_runner',
]

When deleting records, you must ensure that records are deleted from the tables in “bottom-up” order. Attempting to delete records from a table higher up in the FK_DEPS order while there are still records in a table lower down in the list will mean you encounter FK constraint errors.

For example, we might have records in the cohort and cohort_sequencing_group tables:

`cohort`	id
1

`cohort_sequencing_group`	cohort_id	sequencing_group_id
1	1
1	2

If we attempt to delete from cohort where id=1, we will be unable to due to the FK dependency this record has on the records in the cohort_sequencing_group table. So, we must first delete the records from the cohort_sequencing_group table, and only then can we delete the record from the cohort table.

Following from this dependency logic, we could quite freely delete entries from analysis_cohort with little regard for any other records in the database. However, like in the example above, if we tried to delete records from cohort, we must first handle the deletes in the tables below cohort, i.e. analysis_cohort and cohort_sequencing_group. Therefore, the delete module should begin with endpoints that allow us to delete records from the lower order tables, like cohort_sequencing_group, analysis_cohort, participant_phenotypes, etc. before the logic is expanded to handle deletes from the tables with more foreign key dependencies.

Given the above, the logical way to scope this delete module is to iteratively create endpoints to delete records from tables, starting with the tables at the bottom of the FK_DEPS ordered list, before working our way up the list.

Deleting from tables with multiple FK dependencies

To expand the module and allow for deletes in the tables with more FK dependencies, we want to collect and then delete records in a cascading & hierarchical manner. How could this be implemented?

For example, if we wanted to delete a participant and all their related data, we design a query that collects all other records associated with that participant's ID. Then, the related records could be individually deleted from the tables in the order defined by the FK_DEPS list. Such a query might look like this:

SELECT 
    p.project as current_project_id,
    p.id as participant_id,
    p.external_id as participant_external_id,
    s.id as sample_id,
    s.external_id as sample_external_id,
    sg.id as sequencing_group_id,
    sga.assay_id
FROM participant p 
JOIN sample s ON s.participant_id = p.id
JOIN sequencing_group sg ON sg.sample_id = s.id
JOIN sequencing_group_assay sga ON sga.sequencing_group_id = sg.id
WHERE p.id=1;

From this SELECT, we’ve collected all samples, sequencing groups, and assays associated with the participant. Based on the FK_DEPS ordering, we would want to first delete from cohort_sequencing_group using the sequencing_group_id values returned by the above SELECT. Next, we can delete from participant_phenotypes, followed by family_participant, and so on, working our way up to the participant table. Once there, we can delete the records from the participant table - because all associated records in the other tables have been deleted.

Deleting Analysis Files

Another massive consideration here is the deletion of analyses and files external to Metamist. The "Deletes" module should have the capability to log and delete files from the GCP buckets, as well as from Metamist. If we are instructed to delete all data for a given participant, we must ensure that any analyses created from this participant’s data are also deleted. At present, this would exclude “joint analyses”, solely due to the difficulty in removing specific individuals or samples from some jointly created analysis.

Metamist tracks most analysis files generated by the pipelines through the analysis records. The Deletes module would need to collect analysis outputs and then delete them from cloud storage - but this alone is not sufficient. This is because not every single file in GCP has been logged in Metamist. Thus, the module will also need to be able to scan through a project's buckets, collect the URIs of files which relate to the participants, samples, or sequencing groups being deleted, and queue the files for deletion.

For example: the Align pipeline stage produces CRAM files alongside a Metamist analysis record, which serves to associate the sequencing group with the CRAM file, via the analysis output field. In constrast, the Somalier pipeline stage produces .cram.somalier files, but no associated analysis record is created to record this. So, if we only search for files via analysis records in Metamist, we will miss other files which only exist in cloud storage and not in Metamist, like the somalier file. This provides a challenge when ensuring all data is deleted for a given individual, sample, or sequencing group. Thankfully, the .cram.somalier file can be found at the same location as the CRAM, and it also has the sequencing group ID as its filename. So it could easily be found and added to the list of files to delete. This is true for most files created by the production pipelines, however the module should still perform an exhaustive search across all bucket paths to find any files that should be queued for deletion.

dancoates commented 3 months ago

Thank you for writing this up! It would be good to have a bit more info around the scenarios where deletion and updates are required. Reading between the lines it sounds like there might be a couple of use-cases?

We could reduce the risk of deletions by performing soft deletes or replacing deletes with "archiving" in some cases, where archived records are excluded from everything by default but can be unarchived if needed. That would make it possible to roll out to more people safely. But it sounds like there is also a use-case where hard deletions are needed due to participants opting out or similar withdrawals. If that is the case then I think it might be better to consider those separate features as the needs would be quite different.

EddieLF commented 3 months ago

@dancoates some situations I have come across or can imagine coming across that involve data deletion:

A participant withdraws their consent and requests all their genomic data and metadata is deleted
Due to an issue on their end, a sequencing provider issues a recall for some genomic data - requesting that the data and all references to it are deleted (e.g. here)
Genomic data (assays) are ingested and mapped to the wrong sample / participant due to errors in the metadata provided by the collaborators
Database entries for participants, samples, or families etc. are created upon ingesting a metadata manifest that contains errors (e.g. here)
Some miscommunication or metadata error occurs and records that should not be created, are created (e.g. here)

Some situations where updates might be required:

A sample is mistakenly assigned to the wrong participant due to an error in the metadata provided by a collaborator. In this case we may want to update the sample's participant without necessarily deleting any existing entries.
A collaborator with multiple projects / datasets wants a family (and all associated data) moved from one project to another (discouraged in our cohorts, but it still sometimes happens, e.g. here)

Re "soft deletes" - I agree with using soft deletes for most cases. If we can archive metadata records such that they can't be accessed by analysts, then this effectively removes them from the database without committing to a hard delete. As you mention, in some cases hard deletes would be necessary, such as if consent were withdrawn and we became required to destroy the associated data.

cassimons commented 3 months ago

Looks great.

I will note that even in the event of consent being withdrawn soft deletes will be appropriate for many metadata types. Generally, we do not need to scrub all traces of a sample from our records, rather we need to ensure the sample is not used for future analysis. We will probably want to hard delete some primary data and selected metadata fields, but the aim is generally not to erase all evidence of existence.

nevoodoo commented 3 months ago

Nice work @EddieLF and @amyminiter this looks great : ) Having had to run the SQL queries for the RD team, it definitely comes across as a pain point for yous, so this would def be a good solution to implement. In particular, having sufficient logging and tracking of the deletes will be crucial so it is great that you have acknowledged this.

illusional commented 3 months ago

With system-versioned tables, we always sort of have a soft-delete.

I'm super fine with deletes, but we struggled with what it means when someone wants to delete a participant:

Should you delete the sample?
Should it delete analysis, if so, which ones?
How about when we delete a sequencing group, how far up the chain should it go, and vice versa on the way down.

We've historically struggled to model this delete process, which is why it's been tricky to create a one-catch all endpoint to run this.

A couple of dev notes here:

I think we should probably enable cascade deletes for some foreign keys, I think the "fail" mode for foreign-keys does just create more work that required here (ordering of specific deletes).
If we go the delete route, we need to make sure this is recorded in the audit_log somehow.

dancoates commented 3 months ago

System-versioned tables are definitely good for peace of mind, but I don't think we can really say that we have soft deletes until we build in the features to allow viewing deleted items and un-deleting them through the API or UI. I think it could be quite tricky to track deleted relations?

Also agree that it's hard to define where the bounds of a delete are. I think we'd need to come up with a really solid and as comprehensive as possible list of use cases and define exactly what would be deleted in each of them. Hopefully some patterns would emerge from doing that exercise

populationgenomics / metamist