Open EddieLF opened 5 months ago
Thank you for writing this up! It would be good to have a bit more info around the scenarios where deletion and updates are required. Reading between the lines it sounds like there might be a couple of use-cases?
We could reduce the risk of deletions by performing soft deletes or replacing deletes with "archiving" in some cases, where archived records are excluded from everything by default but can be unarchived if needed. That would make it possible to roll out to more people safely. But it sounds like there is also a use-case where hard deletions are needed due to participants opting out or similar withdrawals. If that is the case then I think it might be better to consider those separate features as the needs would be quite different.
@dancoates some situations I have come across or can imagine coming across that involve data deletion:
Some situations where updates might be required:
Re "soft deletes" - I agree with using soft deletes for most cases. If we can archive metadata records such that they can't be accessed by analysts, then this effectively removes them from the database without committing to a hard delete. As you mention, in some cases hard deletes would be necessary, such as if consent were withdrawn and we became required to destroy the associated data.
Looks great.
I will note that even in the event of consent being withdrawn soft deletes will be appropriate for many metadata types. Generally, we do not need to scrub all traces of a sample from our records, rather we need to ensure the sample is not used for future analysis. We will probably want to hard delete some primary data and selected metadata fields, but the aim is generally not to erase all evidence of existence.
Nice work @EddieLF and @amyminiter this looks great : ) Having had to run the SQL queries for the RD team, it definitely comes across as a pain point for yous, so this would def be a good solution to implement. In particular, having sufficient logging and tracking of the deletes will be crucial so it is great that you have acknowledged this.
With system-versioned tables, we always sort of have a soft-delete.
I'm super fine with deletes, but we struggled with what it means when someone wants to delete a participant:
We've historically struggled to model this delete process, which is why it's been tricky to create a one-catch all endpoint to run this.
A couple of dev notes here:
System-versioned tables are definitely good for peace of mind, but I don't think we can really say that we have soft deletes until we build in the features to allow viewing deleted items and un-deleting them through the API or UI. I think it could be quite tricky to track deleted relations?
Also agree that it's hard to define where the bounds of a delete are. I think we'd need to come up with a really solid and as comprehensive as possible list of use cases and define exactly what would be deleted in each of them. Hopefully some patterns would emerge from doing that exercise
Problem Statement
Data managers should have the ability to update and delete records from Metamist. Currently, there is limited ability to update records, and where there is, it's mostly confined to meta fields. There is no ability to delete records without accessing the Cloud VM running Metamist and executing SQL statements, which can only be done by certain software team members. This is an issue because the software team does not have full visibility over data management issues and requirements, which lie outside of their scope of responsibility anyway.
Solution Proposition
The development of a "Deletes" module for Metamist is proposed in this issue. A separate module for "Updates" is also discussed, however this is not the primary target of this issue. An "Updates" module is likely to be closely related to a "Deletes" module in terms of the logic used and infrastructure required, so this may naturally extend the scope of this issue throughout its development and implementation.
Explicitly, the module proposed would consist of:
The module should take simple inputs, e.g. record IDs, and then handle the complex logic and dependency resolution behind the scenes, collecting records to delete, then actioning these deletes in the correct order, and logging the results transparently.
At first, the scope of this module should be limited to test datasets. The inherent dangers of releasing API endpoints that can delete records poses too great a risk to production data to allow the module to be used on production datasets - until such a time that the permissions structure is more strongly defined and tested.
Discussion
Incongruities between data managers & software team
Data managers & other analysts currently have no autonomy to delete records in the database without direct intervention from members of the Software team. However within the software team, there is understandably lower visibility or knowledge of ad hoc data management issues originating in the analysis teams. This has led to a system where the knowledge of what needs to be deleted and the permissions to action the deletes are siloed between the teams in a very inefficient way.
Metamist assigns users roles which determine their access to datasets and their read/write permissions of the data within. There are currently no roles or modules that allow for data deletion via Metamist, and very few that allow for updating records. Where update modules do exist, they are generally confined to updating fields within the same table.
participant
reported_sex
can be updatedsample
meta
fields can be upserted.participant
’sfamily
cannot be updated.participant
that asample
belongs to cannot be updated.The largest barrier to developing and implementing an updating / deleting module is within the metamist schema and data model. Metamist is built on a hierarchical ordering of tables based on the number of foreign key dependencies of each table. If changes to the database records don’t take into account the schema, or the order in which the tables are affected, the changes will not succeed. Additionally, the way the Metamist tables are set up is such that when two tables have a relationship, this is modelled by a third table that stores the primary keys of each table as foreign keys. For example, the
assay
table and thesequencing_group
table link their records through thesequencing_group_assay
table.Foreign Key Dependencies in Metamist
From
db/connect.py
, we have the Metamist tables ordered by number of foreign key dependencies. The tables with the most dependencies are listed at the top, and those with the least are listed at the bottom.When deleting records, you must ensure that records are deleted from the tables in “bottom-up” order. Attempting to delete records from a table higher up in the FK_DEPS order while there are still records in a table lower down in the list will mean you encounter FK constraint errors.
For example, we might have records in the
cohort
andcohort_sequencing_group
tables:cohort
cohort_sequencing_group
If we attempt to delete from
cohort
whereid=1
, we will be unable to due to the FK dependency this record has on the records in thecohort_sequencing_group
table. So, we must first delete the records from thecohort_sequencing_group
table, and only then can we delete the record from thecohort
table.Following from this dependency logic, we could quite freely delete entries from
analysis_cohort
with little regard for any other records in the database. However, like in the example above, if we tried to delete records fromcohort
, we must first handle the deletes in the tables belowcohort
, i.e.analysis_cohort
andcohort_sequencing_group
. Therefore, the delete module should begin with endpoints that allow us to delete records from the lower order tables, likecohort_sequencing_group
,analysis_cohort
,participant_phenotypes
, etc. before the logic is expanded to handle deletes from the tables with more foreign key dependencies.Given the above, the logical way to scope this delete module is to iteratively create endpoints to delete records from tables, starting with the tables at the bottom of the FK_DEPS ordered list, before working our way up the list.
Deleting from tables with multiple FK dependencies
To expand the module and allow for deletes in the tables with more FK dependencies, we want to collect and then delete records in a cascading & hierarchical manner. How could this be implemented?
For example, if we wanted to delete a participant and all their related data, we design a query that collects all other records associated with that participant's ID. Then, the related records could be individually deleted from the tables in the order defined by the FK_DEPS list. Such a query might look like this:
From this SELECT, we’ve collected all samples, sequencing groups, and assays associated with the participant. Based on the FK_DEPS ordering, we would want to first delete from
cohort_sequencing_group
using thesequencing_group_id
values returned by the above SELECT. Next, we can delete fromparticipant_phenotypes
, followed byfamily_participant
, and so on, working our way up to theparticipant
table. Once there, we can delete the records from the participant table - because all associated records in the other tables have been deleted.Deleting Analysis Files
Another massive consideration here is the deletion of analyses and files external to Metamist. The "Deletes" module should have the capability to log and delete files from the GCP buckets, as well as from Metamist. If we are instructed to delete all data for a given participant, we must ensure that any analyses created from this participant’s data are also deleted. At present, this would exclude “joint analyses”, solely due to the difficulty in removing specific individuals or samples from some jointly created analysis.
Metamist tracks most analysis files generated by the pipelines through the
analysis
records. The Deletes module would need to collect analysis outputs and then delete them from cloud storage - but this alone is not sufficient. This is because not every single file in GCP has been logged in Metamist. Thus, the module will also need to be able to scan through a project's buckets, collect the URIs of files which relate to the participants, samples, or sequencing groups being deleted, and queue the files for deletion.For example: the
Align
pipeline stage produces CRAM files alongside a Metamistanalysis
record, which serves to associate the sequencing group with the CRAM file, via theanalysis
output
field. In constrast, theSomalier
pipeline stage produces.cram.somalier
files, but no associated analysis record is created to record this. So, if we only search for files viaanalysis
records in Metamist, we will miss other files which only exist in cloud storage and not in Metamist, like the somalier file. This provides a challenge when ensuring all data is deleted for a given individual, sample, or sequencing group. Thankfully, the.cram.somalier
file can be found at the same location as the CRAM, and it also has the sequencing group ID as its filename. So it could easily be found and added to the list of files to delete. This is true for most files created by the production pipelines, however the module should still perform an exhaustive search across all bucket paths to find any files that should be queued for deletion.