overture-stack / SONG

Metadata management and automated validation system
https://www.overture.bio/products/song
GNU Affero General Public License v3.0
10 stars 4 forks source link

Epic - Separate Donor/Specimen/Sample model from Analysis model #863

Open joneubank opened 1 month ago

joneubank commented 1 month ago

Summary

Song should be usable for files which do not use the Donor/Specimen/Sample model. The current Analysis data model makes these fields a requirement for every submitted analysis, forcing data that does not use these fields to fill in these fields to satisfy the software. Additionally, when multiple analyses are submitted for a single donor, the donor information needs to be repeated for each of these analyses. This causes a duplication of data, and can even suffer from input errors from one analysis to another. There is an additional issue with Song data model for Donors/Specimen/Samples being limited and not customizable to the data that different systems wish to collect.

Overture is developing towards having separate services for tracking structured data, see Lectern and Lyric. These services provide the ability to track any data model for Donors, registration of their Specimen and Samples, and any other clinical or phenotypical data that is relevant to a study. This frees up Song to focus on being the service to track Analysis meta-data.

In order to connect the Analysis data with related structured data, a system can include a field in their Dynamic Schema which will provide an ID to link this analysis to the data tracked in lyric. This mapping becomes fully customizable through the Dynamic Schema definition. With this change, we will need to provide a mechanism for Song to check with an external service to validate that the provided value in one of these fields is registered with an external ID/data service.

Song as ID Service

One feature of Song that is lost by this change will be the use of Song to generate system wide unique IDs for Donors, Specimen, and Samples. Song has previously had the option to work as an ID server, generating unique IDs for these entities the first time an analysis that referenced them was submitted. As these entities will no longer be a standard part of the Analysis model, there is no mechanism for identifying Donors/Speciment/Samples and so they will no longer be tracked by Song's database, and no Donor/Specimen/Sample IDs will be generated by Song.

Most cases where Song has been used do not require this feature. A separate ID service has been used to create system wide IDs. This is either because of a federated data management model that has multiple Song instances, or because clinical data records are being registered and tracked separate from the file data. This will standardize this process for all future instances of Song.

Implementation Details

Updating the Data Model

To accomplish this change we need to remove these entities from several parts of the code base, and update some functionality of the Song Server.

Preserving Legacy Schemas

External Validation of Dynamic Schema Properties

In the current implementation, when Song was not used as an ID service, Song was able to fetch system wide IDs for Donors/Specimen/Samples via an HTTP request to a configurable ID server. With these entities no longer being part of the coded data model there is no longer a need to fetch IDs for these entities - Song will not be recording them.

However, the external ID check was used to ensure that entities had been registered with an ID system before accepting Analysis registration with Song. This is a feature we want to re-implement, but will now need to be part of Dynanic Schemas.

Additional context

The intention of this change is to simplify data management with Song, with two cases in particular:

  1. When Donor data is stored in another system, it is frequently duplicated into Song.
    1. This can cause replication errors where the data in Song and in the official clinical data source do not match. This inconsistent data becomes an issue when researchers try to analyze the inconsistent data.
    2. Duplicated data in two sources has to be carefully and consistently managed when combining the clinical data source with Song, for instance when packaging data into downloadable archives or when indexing data for search purposes. This is technical complexity that doesn't need to exist.
  2. Song was maintaining extra APIs for managing Donors/Specimen/Samples that were actually quite challenging (if not impossible) to use for modifying and correcting incorrectly submitted data. While losing this functionality is regrettable, it is beneficial to not have to maintain and fix it.
joneubank commented 3 weeks ago

Tracking changes for this feature in the branch: feat/refactor-analyis-data-model