va-big-data-genomics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Newly delivered samples are not being added to database #41

Closed pbilling closed 1 year ago

pbilling commented 1 year ago

Last sample added to database was created in June 2022

pbilling commented 1 year ago

One issue is that Personalis created a new "va_mvp_phase3" directory to deliver new samples and has also changed the file naming conventions, without providing any notification. These changes occurred in December of 2022.

pbilling commented 1 year ago

Manually checking cloud storage, it looks like the last genome from batch 2 was delivered in June 2022, so the new directory structure & naming conventions seem to be the sole culprit here. I'll need to deploy a hotfix so that Trellis recognizes the new pattern.

pbilling commented 1 year ago

New object pattern examples:

Looks like only the fastqs have changed.

pbilling commented 1 year ago

Added new match patterns in config/phase3/from-personalis/create-node-config.py in the NodeKinds class.

pbilling commented 1 year ago

Also need to update the label_functions for extracting values from the object name. Existing method is to use helper methods (mate_pair_name_0, read_group_name_1). These seem fragile and unnecessary; I think I can just extract values using the regex groups.

pbilling commented 1 year ago

Changes have been incorporated into hotfix-1.2.9

Workflow:

pbilling commented 1 year ago

Verified that Fastq, JSON, and checksum nodes have all been added to the database and are being related by...

But FastqToUbam jobs are not being triggered.

pbilling commented 1 year ago

Troubleshooting why LaunchFastqToUbam trigger is not being activated:

The value I was using to parse read groups by the old naming conventions has been replaced by a lane index and now fastqs are missing the 'readGroup' property.

pbilling commented 1 year ago

Also, I'm realizing I need to update the get_fastq_metadata() function with logic to parse phase2 and phase3 Fastqs, not just the new phase3 ones.

pbilling commented 1 year ago

Added logic to get Fastq properties from fullmatch.groupdict() in get_fastq_metadata(): https://github.com/StanfordBioinformatics/trellis-mvp-functions/commit/7d7d03070231802847285dda686e000287acd373.

I tried writing a local test for behavior but it was turning out to be a pain and I'm going to deprecate these methods soon anyway, so I am just going to test interactively in the test environment.

pbilling commented 1 year ago

I'm testing "adding" Fastqs by updating their metadata values. This way I can signal to Trellis that the state of an object has changed without having to actually move/change and object data.

pbilling commented 1 year ago

Steps I use to validate addition on an object to the database:

  1. Check the GCP Error Reporting service to make sure there are no function errors
  2. Check the function logs using "Logs Explorer" in "Operations Logging" to identify any aberrant behavior
  3. Check the database for the object node

Usually I sort of work backwards from 3 towards 1 since if there is an issue it will likely manifest in the end product (database node) and then 1 & 2 can be used for debugging.

Simple query to get logs for Cloud Functions:

resource.type="cloud_function"
severity=(DEFAULT OR DEBUG OR INFO OR NOTICE OR WARNING OR ERROR OR CRITICAL OR ALERT OR EMERGENCY)
pbilling commented 1 year ago

Trellis seems to be ingesting and processing phase 3 data properly. Only issue I noticed is that there were (2) "Genome" nodes generated. This shouldn't break anything and is probably an artifact of me adding these nodes, deleting them, and them adding them again with second update. Trellis still shouldn't generate duplicates but I'm not going to spend time debugging now since the v1.3 update will deprecate these methods.

Most recent hotfix commit: https://github.com/StanfordBioinformatics/trellis-mvp-functions/commit/42b384eee3c1ab6ff696eee9cac2e96c452cde95.