Open pbilling opened 1 year ago
Basic graph model:
graph TD
dsub[Dsub Job] -- GENERATED --> qaTable[Quality Assessment Table]
qaTable -- INSERTED_INTO --> dbTable[Database Table]
db[Database] -- HAS_TABLE --> dbTable
Need to generate (2) parameterized database queries:
Cypher query to create Database Table & Database nodes:
MERGE (db:Database {name: ${database_name})
ON CREATE SET
db.private_ip_address = ${private_ip}
db.vcpus_count = ${vcpus}
db.memory_gb = ${memory}
db.storage_gb = ${storage}
db.software = ${software}
db.version = ${version}
db.zone = ${zone}
db.network = ${network}
MERGE (db)-[rel:HAS_TABLE]->(dbTable:DatabaseTable {name: "fastqc"})
ON CREATE SET
db.schema = ${table_schema}
RETURN db, rel, dbTable
Challenge in connecting the qa table to the db table is there is no obvious mapping. The logs from Cloud SQL aren't going to know anything about the source file that the data was read from. Maybe I can connect the logs to the Trellis job and then connect the job to the input... and then that input to the database table.
First, I need to see what the logs for CloudSQL operations look like.
CloudSQL logs don't appear to show anything related to database operations. Logs just show automatic housekeeping logs: "automatics analyze of table" and "automatic vacuum of table".
Alternative solution is to just send the DatabaseQueryRequests from the postgres-insert-data
Trellis function, as the operations are performed. I think this is fine and it has the advantage that the function knows both ends of the operation: source object and database/table.
I also realized that the implementation of the postgres-insert-data function includes a design pattern that runs counter to the rest of Trellis. Inserting QC data into the Postgres database is handled by that function, but the operation is tracked in Neo4j as a Job node. In every other instance, a Job node refers to an operation that was performed on a virtual machine using a Docker image and a scheduler. Operations that are performed by Trellis functions are not tracked in the database because they aren't considered "jobs", since they aren't transforming data. Here, the data isn't being transformed but it is being loaded into the database; so, should this be considered a Job, should it have a new label, or should it just be an invisible internal Trellis operation?
Sample level quality control data (FastQC, Flagstat, Vcfstats) is stored in CloudSQL database tables. It's not easy to trace the provenance of data from GCS -> CloudSQL right now. Also, not easy to know where these tables live. I think it would be good to represent these data stores in the graph.