va-big-data-genomics / trellis-mvp-data-modelling

Resources uses for interacting and updating the Trellis MVP graph database model.
MIT License
0 stars 0 forks source link

Create graph nodes to represent database tables #1

Open pbilling opened 1 year ago

pbilling commented 1 year ago

Sample level quality control data (FastQC, Flagstat, Vcfstats) is stored in CloudSQL database tables. It's not easy to trace the provenance of data from GCS -> CloudSQL right now. Also, not easy to know where these tables live. I think it would be good to represent these data stores in the graph.

pbilling commented 1 year ago

Basic graph model:

graph TD
    dsub[Dsub Job] -- GENERATED --> qaTable[Quality Assessment Table]
    qaTable -- INSERTED_INTO --> dbTable[Database Table]
    db[Database] -- HAS_TABLE --> dbTable
pbilling commented 1 year ago

Need to generate (2) parameterized database queries:

pbilling commented 1 year ago

Cypher query to create Database Table & Database nodes:

MERGE (db:Database {name: ${database_name})
ON CREATE SET
  db.private_ip_address = ${private_ip}
  db.vcpus_count = ${vcpus}
  db.memory_gb = ${memory}
  db.storage_gb = ${storage}
  db.software = ${software}
  db.version = ${version}
  db.zone = ${zone}
  db.network = ${network}
MERGE (db)-[rel:HAS_TABLE]->(dbTable:DatabaseTable {name: "fastqc"})
ON CREATE SET 
  db.schema = ${table_schema}
RETURN db, rel, dbTable
pbilling commented 1 year ago

Challenge in connecting the qa table to the db table is there is no obvious mapping. The logs from Cloud SQL aren't going to know anything about the source file that the data was read from. Maybe I can connect the logs to the Trellis job and then connect the job to the input... and then that input to the database table.

First, I need to see what the logs for CloudSQL operations look like.

pbilling commented 1 year ago

CloudSQL logs don't appear to show anything related to database operations. Logs just show automatic housekeeping logs: "automatics analyze of table" and "automatic vacuum of table".

pbilling commented 1 year ago

Alternative solution is to just send the DatabaseQueryRequests from the postgres-insert-data Trellis function, as the operations are performed. I think this is fine and it has the advantage that the function knows both ends of the operation: source object and database/table.

pbilling commented 1 year ago

I also realized that the implementation of the postgres-insert-data function includes a design pattern that runs counter to the rest of Trellis. Inserting QC data into the Postgres database is handled by that function, but the operation is tracked in Neo4j as a Job node. In every other instance, a Job node refers to an operation that was performed on a virtual machine using a Docker image and a scheduler. Operations that are performed by Trellis functions are not tracked in the database because they aren't considered "jobs", since they aren't transforming data. Here, the data isn't being transformed but it is being loaded into the database; so, should this be considered a Job, should it have a new label, or should it just be an invisible internal Trellis operation?