theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaCoV_Fasta_Batch] Substitute FASTA concatenating task to ensure proper sample_id propagation #274

Closed cimendes closed 10 months ago

cimendes commented 10 months ago

Closes #261

:hammer_and_wrench: Changes Being Made

This PR introduces a new task to concatenate FASTA files where the array of samplenames are passed along the FASTA files for correct sample_id propagation. This is important for cases where the FASTA header doesn't match the assigned sample_id, such as with GISAID FASTA files

Impacted Workflows/Tasks

TheiaCoV_FASTA_Batch

:brain: Context and Rationale

None to be considered.

:clipboard: Workflow/Task Steps

The new cat_files_fasta task has been integrated onto the TheiaCoV_Fasta_Batch workflow. It is not used by any other workflows.

Inputs

None added.

Outputs

No outputs were altered.

Impacted Outputs

No outputs were altered.

:test_tube: Testing

Underway!

Locally

image

Terra

https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/ad02cccf-63e7-4bc6-9eb6-c4b430113260

Scenarios for Reviewer to Test

:microscope: Quality checks

Pull Request (PR) checklist:

cimendes commented 10 months ago

This works great! One thing we might want to consider doing is removing the long Array[Pair[String, File]] object since it's no longer necessary and just passing in the array of samplenames instead. That would simply the code a bit and I think it would be worth it.

Once that's done, I'll approve & merge. Well done! ⭐

All done! :D And thank you for finding and squishing the newline bug! 😍