theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[New Utility] Workflow to rename FASTQ files (non-destructive) #267

Closed cimendes closed 10 months ago

cimendes commented 10 months ago

Closes #266

:hammer_and_wrench: Changes Being Made

This PR implements a new utility workflow as requested in #266. This workflow received a read file or a pair of read files (FASTQ), compressed or uncompressed, and returned a new, renamed and compressed FASTQ file for submission in GISAID.

Impacted Workflows/Tasks

None, this is a new implementation

:brain: Context and Rationale

Requested by SFPHL

:clipboard: Workflow/Task Steps

This is a sample-level workflow. If a reverse read (read2) is provided, the files get renamed to the provided new_filename input with the notation <new_filename>_R1.fastq.gz and <new_filename>_R2.fastq.gz. If only read1 is provided, the file gets renamed to <new_filename>.fastq.gz. If a not-compressed file is provided, this gets compressed automatically by the workflow. ´

Inputs

  input {
    File read1  
    File? read2
    String new_filename
  }

read1: Mandatory input; Forward-facing or single-end reads, compressed or uncompressed read2: Optional input; Reverse-facing reads, compressed or uncompressed new_filename: Mandatory input; String with new name for read files

Outputs

output {
    String rename_fastq_files_version = version_capture.phb_version
    String rename_fastq_files_analysis_date = version_capture.date
    File read1_renamed = select_first([rename_PE_files.read1_renamed, rename_SE_files.read1_renamed])
    File? read2_renamed = rename_PE_files.read2_renamed
  }

rename_fastq_files_version: version of PHB used to run this workflow rename_fastq_files_analysis_date: date of renaming read1_renamed: Forward-facing or single-end reads. Always present read2_renamed: Reverse-facing reads. Only present if read2 is provided

Impacted Outputs

None, new workflow

:test_tube: Testing

Locally

Paired-end reads (compressed):

image

Single-end reads (compressed):

image

Single-end reads (uncompressed):

image

Terra

Underway

22 Samples PE, Compressed: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/ff1cda0d-1c92-4fc0-8b5e-7001d520cfe6

Scenarios for Reviewer to Test

:microscope: Quality checks

Pull Request (PR) checklist:

cimendes commented 10 months ago

Looks good, but would drop the single end version, since r2 is already optional and you're checking for its existence in the PE version of the task anyways.

The reason why I introduced the SE task was because I couldn't find a smart way to have the output be .fastq.gz or _1.fastq.gz for the forward read... Maybe we can think of something smart together? :)