Closed pbilling closed 2 years ago
Each FastqToUbam job has a property named "output_UBAM" with the URI for the expected output (e.g. "gs://bucket/path"). For FastqToUbam jobs not related to a Ubam, I could query for those missing Ubams and connect them.
Ubams do not have a URI property so I could either:
apoc.periodic.iterate()
function to apply it. This also improves the usability of the database for the future.So, I will add a URI property to all the Ubams.
Adding URI property to Ubams
// Cypher query
CALL apoc.periodic.iterate(
"MATCH (u:Ubam) WHERE NOT EXISTS(u.uri) RETURN u",
"MATCH (u) SET u.uri = 'gs://' + u.bucket + '/' + u.path",
{batchSize: 200, parallel:true})
// Add index on Ubam URI property
CREATE INDEX ON :Ubam(uri)
Estimate number of relationships that will be added
We expect, based on how many disconnected Ubams we identified, that about 4,700 relationships will be added. But that query was from the (:Ubam) end of the relationship, let's also run a query from the (:FastqToUbam) end.
// Cypher query
MATCH (job:FastqToUbam)
WHERE NOT (job)-[:GENERATED]->(:Ubam)
WITH job
MATCH (u:Ubam)
WHERE job.output_UBAM = u.uri
RETURN COUNT(job)
Count
4,599
The result is not exactly the same so there may be some other issues as well, but the count is close enough that I'm going to proceed and worry about the small proportion of unaccounted Ubams later.
Create relationships between FastqToUbam and Ubam nodes
// Cypher query
MATCH (job:FastqToUbam), (u:Ubam)
WHERE NOT (job)-[:GENERATED]->(:Ubam)
AND NOT (u)<-[:GENERATED]-()
AND job.output_UBAM = u.uri
WITH job, u
MERGE (job)-[:GENERATED]->(u)
RETURN COUNT(job)
If this query was taking a long time, we could also apply it using apoc.periodic.update()
as we did previously.
Check that relationships have been added
// Cypher query
MATCH (u:Ubam)
WHERE NOT (u)<-[:GENERATED]-(:FastqToUbam)
RETURN COUNT(u)
COUNT(u)
120
We see that the number has dropped form 4,719 to 120 which is great, but there are some Ubams which are still not connected to FastqToUbam nodes. Further examination exists this is a separate issue of (:Job) and (:FastqToUbam) nodes not being properly merged so that we end up with a pattern that looks like this: (:Job)-[:GENERATED]->(:Ubam)<-[:GENERATED]-(:FastqToUbam). I'll address this in a separate issue.
Some Ubam nodes (n=4,719) are not related to the FastqToUbam jobs that generated them.