va-big-data-genomics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Represent data stores in metadata store #46

Open pbilling opened 1 year ago

pbilling commented 1 year ago

Trellis uses a Neo4j database to store metadata related to data objects and jobs. Data generated from quality control procedures such as FastQC, Flagstat, and Vcfstats are stored in a Postgres relational database. Right now, the process of loading data from these CSV result files into the relational database is not tracked in Neo4j. Downsides of this include:

This is a particular issue since we have observed that not all results are being successfully loaded into the database and the lack of visibility makes it harder to resolve this issue.

pbilling commented 1 year ago

Current design

The standard method Trellis uses for tracking provenance is through relationships borrowed from the PROV-O ontology: "GENERATED" and "WAS_USED_BY". Implementation looks like this:

graph TD
    a[Input] -- WAS_USED_BY --> b[Job]
    b -- GENERATED --> c[Output]

In this example, "Input" and "Output" both refer to objects in Cloud Storage. Tracking the output of a job is relatively straightforward when the output is an object, because we can create a storage bucket trigger that is activated every time a new object is added.

I think, ideally, I would like to preserve the same design pattern for adding objects to the metadata store by just replacing a the trigger that is activated by changes to cloud storage with one that is activated by changes to the Postgres database. From on implementation side, I think this will be significantly more challenging. Google has native methods available for creating Cloud Storage triggers. I don't think it was the same for CloudSQL.

pbilling commented 1 year ago

Looks like there is a pre-GA offering to create function triggers based on Firestore changes, but I don't use Cloud Firestore.

pbilling commented 1 year ago

This StackOverflow question seems to align with what I want. Top suggestion is to create a Stackdriver (now Cloud Monitoring) sink for CloudSQL operations and then parsing from there. Downside is getting all the database updates, but given that I am pretty much only doing write operations, I think this is OK.

pbilling commented 1 year ago

These look like the appropriate docs for creating a Cloud Monitoring sink. I don't want to just add it manually though, I want to use Terraform to integrate it with the infrastructure configuration.

pbilling commented 1 year ago

Terraform has multiple Google logging sinks, but this looks like the most appropriate one: google_logging_project_sink.

I notice that in the current Terraform setup, logging sinks are created using modules. I don't like modules because I don't understand them and they seem too complicated.