theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

pangolin TMPDIR add and CI updates & improvements #327

Closed kapsakcj closed 5 months ago

kapsakcj commented 5 months ago

Will fill this out later, just want to get the CI running to test these changes

This PR closes #324

πŸ—‘οΈ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

This PR is spun out of the CI failures we were seeing with cromwell test workflows over in PR #305 just prior to the v1.3.0 release.

TL;DR is that Pangolin, and in particular scorpio which runs as part of pangolin was being fed super-long paths from cromwell. The python multiprocessing package which scorpio uses struggles with these long paths, leading to failures that we had not seen before.

So this PR serves 2 purposes:

  1. Update the Pangolin task to use TMPDIR=/tmp which leads to shorter paths, and force the usage of pangolin --tempdir /tmp option so that temporary file paths are shorter.
  2. Update our CI to account for the above changes AND also upgrade other CI components for general improvement (better alignment with Cromwell version that runs on Terra on GCP, squash warnings about node.js, update cromwell CI output file checks)

FOR THE FUTURE: we may want to switch the CI, specifically the CI that tests workflows via running cromwell, to run inside of a docker container instead of using a conda environment. This way we can stay in closer alignment with the version of Cromwell that is used on Terra on GCP. The published github releases lag behind the versions of cromwell used in Terra on GCP. So we could in theory run those workflows inside of this docker image: broadinstitute/cromwell:latest https://hub.docker.com/r/broadinstitute/cromwell/tags as it will mirror the version currently being used in Terra on GCP.

Not ideal to have a key dependency to the CI environment like this change unexpectedly, but if we want to reproduce/replicate cromwell behavior on Terra on GCP, it would be good for us to switch to using docker for running cromwell

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

Changes being made

Impacted workflows

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version: No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing: No

:clipboard: Workflow/Task Step Changes

πŸ”„ Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes - temorary directory used by pangolin is now hardcoded to /tmp. Do not expect results to change, just the destination for reading/writing intermediate files used by pangolin.

File processing changed: No

Compute resources changed: No

➑️ Inputs

N/A

⬅️ Outputs

phb_version (output string of versioning task) now states PHB v1.3.0-main so that user on the main branch know that they are using main but the version of main downstream of v1.3.0 release. This change was also made on purpose to trigger TheiaProk CI workflows.

:test_tube: Testing

Test Dataset

Will test with SARS-CoV-2 samples on various TheiaCov workflows.

Commandline Testing with MiniWDL or Cromwell (optional)

did not test locally, mainly want to see how these workflow changes perform in Terra and in the CI workflows

Terra Testing

Will update with tests soon.

Suggested Scenarios for Reviewer to Test

Test any TheiaCov workflows with sars-cov-2 samples as it will run the pangolin task.

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 5 months ago

βœ… 's across the board πŸ˜„

sage-wright commented 5 months ago

Tested Pangolin and found no changes here. Excited to merge this one.

kapsakcj commented 5 months ago

Tests launched in Terra:

EDIT: all workflows ran successfully and produced outputs as expected βœ