va-big-data-genomics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Write standard job launcher function #36

Open pbilling opened 1 year ago

pbilling commented 1 year ago

Current method for adding new bioinformatics (or other) tasks to Trellis is to create a new Cloud Function specifically tailored to launch jobs of that type of task (e.g. "samtools flagstat"). Limitations of generating separate functions for each task include:

A better approach could be to write a single job launcher function and use a YAML configuration file to define the parameters of all the supported tasks. Benefits:

pbilling commented 1 year ago

When every job had its own launcher function, the job was determined by the pubsub topic that the database query result was published to. The topic was defined as part of the database query. How will I choose the job if all query results are routed through the same function?

I could update the QueryResponse classes to also include a field with the task to be launched.

pbilling commented 1 year ago

Another challenge: How do I specify output URIs? These paths involve multiple variables including ones from the Trellis config, JobLauncher config, and variables defined at runtime (task-id).

pbilling commented 1 year ago

The solution I'm settling on only includes input-specific variables in the task template. Other values (from Trellis, defined at runtime) will be applied uniformly to all job outputs, regardless of task. This changes the ordering of the output elements, but not the content.

An example where bold values have been rearranged and italic values will be defined in the job_launcher function while non-italicized values come from a task template:

Old: gs://{OUT_BUCKET}/{plate}/{sample}/{task.name}/{jobid}/output/{sample}{read_group}.ubam

New: _gs://{OUT_BUCKET}/{task.name}/{jobid}/output/{plate}/{sample}/{sample}_{read_group}.ubam

Here, the plate and sample values have been moved to the end because they (and read_group) are all gotten from the properties of one of the input objects using the string ".format()" method. The structure and types of these input-derived properties may also change from task to task. For instance, if a task requires combining inputs from multiple samples then it doesn't make sense to put them in a path with a singe set of plate/sample values.

Conversely, the {OUT_BUCKET}, {task.name}, and {job_id} values can be applied uniformly to all jobs and their structure will be defined in the job_launcher function. So, the new structure of the outputs reflects a functional organization of values based on how Trellis organizes tasks

pbilling commented 1 year ago

Implement job launcher function and write tests

pbilling commented 1 year ago

Methods for populating dsub values from template implemented in 2b2754fc.

pbilling commented 1 year ago

Task: Create a Cloud Build trigger for Job Launcher function

pbilling commented 1 year ago

Task: Run integration test

Cloud Functions logs query:

resource.type="cloud_function"
severity=(DEFAULT OR INFO OR NOTICE OR WARNING OR ERROR OR CRITICAL OR ALERT OR EMERGENCY)
pbilling commented 1 year ago

Forgot to implement parse_node_inputs() and parse_relationship_inputs() functions. The parse_inputs() function was originally designed to perform job-specific QA on the inputs, which is an idea I like, but how to do it in a generic manner?

I can pretty easily check that the node labels and relationship type are correct, but more than that would require a pretty significant update to the job config template. I think I'll just keep it simple for now.

pbilling commented 1 year ago

AttributeError: 'QueryResponseReader' object has no attribute 'job_request'

Probably need to upload new version of trellisdata package.

pbilling commented 1 year ago

Task: Update trellisdata package

pbilling commented 1 year ago

Task: Update trellisdata tests to check new job request features

pbilling commented 1 year ago

For testing, I'm using the sample name "test-sample-v1-3" in GCP Secret Manager in the test project

pbilling commented 1 year ago

Status update: successfully launched first fastq-to-ubam job