theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

Samples_To_Ref_Tree_PHB: changed "organism" input to "nextclade_dataset_name" #303

Closed jrotieno closed 6 months ago

jrotieno commented 6 months ago

This PR closes #<302>.

๐Ÿ—‘๏ธ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

This change is to avoid confusion between Theiagen's organism input for the various workflows and what was named as organism input in the Samples_To_Ref_Tree_PHB workflow, but is ideally the nextclade_dataset_name. For example, flu is accepted as organism input for the TheiaCoV workflows, but what is expected as the organism input in Samples_To_Ref_Tree_PHB is flu_yam_ha, or flu_h1n1pdm_ha, etc, depending on the subtype and segment.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users donโ€™t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

๐Ÿ”„ Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: No

File processing changed: No

Compute resources changed: No

โžก๏ธ Inputs

Input name changed from organism to nextclade_dataset_name

โฌ…๏ธ Outputs

No Changes

:test_tube: Testing

Test Dataset

A multifasta dataset comprising of nextclade pathogens/datasets, i.e. seasonal iInfluenza A H1N1 and H3N2, Influenza B Yamagata and Victoria, RSV-A and RSV-B, MPXV and SARS-CoV-2.

Commandline Testing with MiniWDL or Cromwell (optional)

Not undertaken/not necessary.

Terra Testing

All samples successful as expected. https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/7c66ad72-086f-4f72-b825-3cc6e71527d1

Suggested Scenarios for Reviewer to Test

Test with inaccurate nextclade_dataset_name!

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

๐ŸŽฏ Reviewer Checklist

๐Ÿ—‚๏ธ Associated Documentation (to be completed by Theiagen developer)

cimendes commented 6 months ago

In TheiaCoV the nextclade_dataset_name refers to the Nextclade organism dataset, which is a controlled vocabulary. I'm afraid that if we name this input dataset_name it's going to cause confusion if one is used to running TheiaCoV, which seems likely.

I don't have a better name for the input organism so I'm going to wait for the next sync with the team to discuss this a bit further. :) Sorry @jrotieno! Right now I don't have a better alternative.

Code changes look solid ๐Ÿ‘๐Ÿป

jrotieno commented 6 months ago

Good point, the simple solution is to call it nextclade_dataset_name for uniformity. Let me do that.

cimendes commented 6 months ago

โœ… Testing 10 Flu HA samples in https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/945bfc27-c46a-4d6d-858a-429947399cc5