theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Upgrade to nextclade v3 & update default dataset tags #375

Closed kapsakcj closed 6 months ago

kapsakcj commented 7 months ago

Still have lots of work to do, but wanted to at least start a draft PR so folks can see progress.

This PR closes #349

🗑️ This dev branch should be deleted after merging to main. (subject to change)

:brain: Aim, Context and Functionality

This PR updates all TheiaCov workflows to use nextclade v3.

A new task was added for nextclade v3 specifically.

CI has undergone major updates to account for these changes AND to re-enable the theiacov_illumina_pe and theiacov_illumina_se CI workflows from running (they were disabled back in July 2023)

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

All TheiaCov workflows are impacted across all organisms (sars-cov-2, Mpox, Flu, RSV-A, RSV-B) that we support and run through nextclade

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : **Yes***

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: upgraded to us-docker.pkg.dev/general-theiagen/nextstrain/nextclade:3.3.1 which was originally copied from Nextstrain's image on dockerhub

Databases or database versions changed: All nextclade_dataset_tags have been updated for all organisms

Data processing/commands changed: Details on changes from v2 ➡️ v3 can be found here: https://docs.nextstrain.org/projects/nextclade/en/stable/user/migration-v3.html#dataset-file-format-and-dataset-names-have-changed

_The dataset files qc.jsonprimers.csv and virus_properties.json are now merged into a new file pathogen.json._

_Dataset names have changed. There is no longer a separation to namereference and other attributes. The datasets are now uniquely identified by a path-like name, which corresponds to the path of the dataset in the [data repo](https://github.com/nextstrain/nextclade_data/tree/master/data)._

File processing changed: The nextclade_output_parsing has not changed. The output TSV of nextclade v3 was readly parsed by the existing WDL task

Compute resources changed: No

➡️ Inputs

⬅️ Outputs

:test_tube: Testing

Test Dataset

I am testing in Terra using our validation datasets that we used to validate new versions of PHB.

Commandline Testing with MiniWDL or Cromwell (optional)

Not going to show cmdline testing, everything will be tested in Terra

Terra Testing

Suggested Scenarios for Reviewer to Test

Need to test all organisms to ensure functionality exists for all organisms.

It seems a bit extreme, but I'd like to test all impacted workflows since I'd rather catch bugs now, instead of the 11th hour before the v2.0.0 release. SO I've created a terra workspace specifically for this testing https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 6 months ago

I created a clone of the PHB_Validation_TEMPLATE workspace specifically for testing these changes: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history

kapsakcj commented 6 months ago

Marking as ready for review to get the review process started.

I'm working on pulling a couple RSV-A and RSV-B genomes for testing (since the 2 I tested with didn't so well - failed to align well with the nextclade references) so I will test these and post a workflow link when it's completed.

2 RSV-A and 2 RSV-B samples tested via TheiaCoV_FASTA here: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history/1da2324d-436e-4b60-a79e-04bceb62a6f7

Other things to note about this PR

sage-wright commented 6 months ago

Conflicts need resolution, but proceeding with tests anyways:

Failures happening because data got moved 😭

kevinlibuit commented 6 months ago

Details for days! That was a beast to review. Thanks for all of the documentation and specific workspace for validation runs. I've combed through all of your runs and feel confident in these code changes; also launched a few runs myself within the same workspace for sanity's sake.

Just two points to address:

  1. Made a few comments regarding input name changes/removals that seem a bit out of sync with respect to nextclade v3 migration guidance -- and because this will only impact optional parameters, I think we can verify things in running a single workflow rather than needing to re-run the whole testing dataset.
  2. I don't see a functional test for workflows/phylogenetics/wf_nextclade_addToRefTree.wdl. It's the same underlying changes being made so I don't anticipate any functional difference, but I would like to see a successful fun before merging this PR.

Once these are resolved we can move forward with a merge!

kapsakcj commented 6 months ago

Thanks for the thorough review, kevin.

_Made a few comments regarding input name changes/removals that seem a bit out of sync with respect to nextclade v3 migration guidance -- and because this will only impact optional parameters, I think we can verify things in running a single workflow rather than needing to re-run the whole testing dataset. I don't see a functional test for workflows/phylogenetics/wf_nextcladeaddToRefTree.wdl. It's the same underlying changes being made so I don't anticipate any functional difference, but I would like to see a successful fun before merging this PR.

I'll update CI first, then...

I'll launch a function test of the nextclade_addToRefTree workflow as well as TheiaCoV_FASTA_PHB on the various organisms in our validation dataset to ensure everything runs as expected.

EDIT:

kapsakcj commented 6 months ago

OK everything should be resolved, I think we are ready for approval and merge

kapsakcj commented 6 months ago

So FYI the nextclade run --input-pcr-primers option is available in v3.1.0 and above: https://github.com/nextstrain/nextclade/issues/1432#issuecomment-2037700068

Should we add it back to the WDL task or leave as is? I don't know of any of our users that utilize this option, so I'm tempted to leave it out.

kevinlibuit commented 6 months ago

So FYI the nextclade run --input-pcr-primers option is available in v3.1.0 and above: nextstrain/nextclade#1432 (comment)

Should we add it back to the WDL task or leave as is? I don't know of any of our users that utilize this option, so I'm tempted to leave it out.

That's fair. Let's get this closed and avoid scope creep; changes made in this PR address #349 for PHB users. Can you open a new issue for us to add that option back and we can discuss internally when to prioritize things