Open pbilling opened 1 year ago
The standard method Trellis uses for tracking provenance is through relationships borrowed from the PROV-O ontology: "GENERATED" and "WAS_USED_BY". Implementation looks like this:
graph TD
a[Input] -- WAS_USED_BY --> b[Job]
b -- GENERATED --> c[Output]
In this example, "Input" and "Output" both refer to objects in Cloud Storage. Tracking the output of a job is relatively straightforward when the output is an object, because we can create a storage bucket trigger that is activated every time a new object is added.
I think, ideally, I would like to preserve the same design pattern for adding objects to the metadata store by just replacing a the trigger that is activated by changes to cloud storage with one that is activated by changes to the Postgres database. From on implementation side, I think this will be significantly more challenging. Google has native methods available for creating Cloud Storage triggers. I don't think it was the same for CloudSQL.
Looks like there is a pre-GA offering to create function triggers based on Firestore changes, but I don't use Cloud Firestore.
This StackOverflow question seems to align with what I want. Top suggestion is to create a Stackdriver (now Cloud Monitoring) sink for CloudSQL operations and then parsing from there. Downside is getting all the database updates, but given that I am pretty much only doing write operations, I think this is OK.
These look like the appropriate docs for creating a Cloud Monitoring sink. I don't want to just add it manually though, I want to use Terraform to integrate it with the infrastructure configuration.
Terraform has multiple Google logging sinks, but this looks like the most appropriate one: google_logging_project_sink.
I notice that in the current Terraform setup, logging sinks are created using modules. I don't like modules because I don't understand them and they seem too complicated.
Trellis uses a Neo4j database to store metadata related to data objects and jobs. Data generated from quality control procedures such as FastQC, Flagstat, and Vcfstats are stored in a Postgres relational database. Right now, the process of loading data from these CSV result files into the relational database is not tracked in Neo4j. Downsides of this include:
This is a particular issue since we have observed that not all results are being successfully loaded into the database and the lack of visibility makes it harder to resolve this issue.