va-big-data-genomics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Ubam nodes not related to FastqToUbam jobs #26

Closed pbilling closed 2 years ago

pbilling commented 2 years ago

Some Ubam nodes (n=4,719) are not related to the FastqToUbam jobs that generated them.

// Cypher query to count disconnected Ubams
MATCH (u:Ubam)
WHERE NOT (u)<-[:GENERATED]-(:FastqToUbam:Job)
RETURN COUNT(u)
COUNT(u)
4,719
pbilling commented 2 years ago

Each FastqToUbam job has a property named "output_UBAM" with the URI for the expected output (e.g. "gs://bucket/path"). For FastqToUbam jobs not related to a Ubam, I could query for those missing Ubams and connect them.

Ubams do not have a URI property so I could either:

  1. Create a query that dynamically generates the URI which will probably be computationally expensive and slow.
  2. Create a batch query to generate URIs for all Ubams and use the apoc.periodic.iterate() function to apply it. This also improves the usability of the database for the future.

So, I will add a URI property to all the Ubams.

pbilling commented 2 years ago

Adding URI property to Ubams

// Cypher query
CALL apoc.periodic.iterate(
        "MATCH (u:Ubam) WHERE NOT EXISTS(u.uri) RETURN u",
        "MATCH (u) SET u.uri = 'gs://' + u.bucket + '/' + u.path",
        {batchSize: 200, parallel:true})
// Add index on Ubam URI property
CREATE INDEX ON :Ubam(uri)
pbilling commented 2 years ago

Estimate number of relationships that will be added

We expect, based on how many disconnected Ubams we identified, that about 4,700 relationships will be added. But that query was from the (:Ubam) end of the relationship, let's also run a query from the (:FastqToUbam) end.

// Cypher query
MATCH (job:FastqToUbam)
WHERE NOT (job)-[:GENERATED]->(:Ubam)
WITH job
MATCH (u:Ubam)
WHERE job.output_UBAM = u.uri
RETURN COUNT(job)
Count
4,599

The result is not exactly the same so there may be some other issues as well, but the count is close enough that I'm going to proceed and worry about the small proportion of unaccounted Ubams later.

pbilling commented 2 years ago

Create relationships between FastqToUbam and Ubam nodes

// Cypher query
MATCH (job:FastqToUbam), (u:Ubam) 
WHERE NOT (job)-[:GENERATED]->(:Ubam) 
AND NOT (u)<-[:GENERATED]-() 
AND job.output_UBAM = u.uri
WITH job, u
MERGE (job)-[:GENERATED]->(u)
RETURN COUNT(job)

If this query was taking a long time, we could also apply it using apoc.periodic.update() as we did previously.

pbilling commented 2 years ago

Check that relationships have been added

// Cypher query
MATCH (u:Ubam)
WHERE NOT (u)<-[:GENERATED]-(:FastqToUbam)
RETURN COUNT(u)
COUNT(u)
120

We see that the number has dropped form 4,719 to 120 which is great, but there are some Ubams which are still not connected to FastqToUbam nodes. Further examination exists this is a separate issue of (:Job) and (:FastqToUbam) nodes not being properly merged so that we end up with a pattern that looks like this: (:Job)-[:GENERATED]->(:Ubam)<-[:GENERATED]-(:FastqToUbam). I'll address this in a separate issue.