theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

[feature request] automatically convert spaces to dashes in workflows that accept strings #326

Closed kapsakcj closed 1 month ago

kapsakcj commented 5 months ago

:cool:

:pushpin: Explain the Request

It would save us all a bunch of time if our workflows like Augur_Prep, kSNP3, and others that accept input strings (like build_name for Augur_prep) are tolerant to spaces in them.

When users provide strings with spaces in them, e.g. my augur build , then the workflow fails due to the output files not being recognized by cromwell.

This will save us and our users much time in troubleshooting failed workflows due to pesky spaces.

A simple sed line to replace spaces with dashes - or underscores _ on the respective variable should do the trick. Likely will go at the top of some command blocks in various WDL tasks.

Off the top of my head I would like to address these workflows, though there may be more:

:books: Context

:chart_with_upwards_trend: Desired Behavior

:information_source: Additional Information

kapsakcj commented 2 months ago

Other String inputs that I think could benefit from this change:

AndrewLangvt commented 2 months ago

@kapsakcj here's all of the ones I was able to find. When you have a moment, would you mind selecting which ones you think truly need this fix? My gut is it's everything in the first section and (maybe) a subset of the second. Obviously, all of the corresponding tasks are where the work would actually need to happen, but these should serve as reasonable "entry points" to identify the workflows needing attention.

The below seem less likely candidates, but adding them here for review regardless:

kapsakcj commented 2 months ago

Nullarbor likely not necessary to update. I don't believe we support that workflow anymore

kapsakcj commented 2 months ago

You can skip all those in the 2nd/less-likely list. Those need to be a specific syntax & known before running the workflow in order for them to work properly.

GSURIs will never ever have spaces so they should fail if there's a typo.

expected_taxon in the context of NCBI amrfinderplus must be strings from this list here https://github.com/ncbi/amr/wiki/Running-AMRFinderPlus#--organism-option , so it should fail if there are spaces in the string. They HAVE to match for example Acinetobacter_baumannii

expected_taxon is also used to trigger organism-specific tasks in the merlin_magic subworkflow, so they will have to match the strings in the conditionals here: https://github.com/theiagen/public_health_bioinformatics/blob/64401efb8b4bd6f2cf4832e9d431354f9b88e12b/workflows/utilities/wf_merlin_magic.wdl#L92

AndrewLangvt commented 2 months ago

Ok, thanks much. So, now we're looking at 16 workflows in total that need modification. Thanks for the input here @kapsakcj

AndrewLangvt commented 1 month ago

Tested all workflows via miniwdl check and also ran through terra with the below results. All passing.