naobservatory / mgs-workflow

3 stars 2 forks source link

Improve flexibility of input files #16

Open willbradshaw opened 1 month ago

willbradshaw commented 1 month ago

Like the v1 pipeline, this workflow assumes that the input fasta files are names PREFIX_1.fastq.gz and PREFIX_2.fastq.gz. This is convenient to write and configure but insufficiently flexible, especially since many input read files are suffixed R1 and R2 rather than _1 and _2.

nf-core pipeline requires a samplesheet that fully specifies the path to each read file. This seems like overkill, but hopefully there's something intermediate that would work well for our use case.

jeffkaufman commented 1 month ago

Renaming input files to have the suffixes we expect when importing them seems fine? There are a lot of formats we could receive data in (interleaved fastq, bam, nanopore squiggles) and requiring a little munging before data goes into the pipeline doesn't seem to bad to me

willbradshaw commented 1 month ago

I think at a minimum we should be able to handle R1 and R2.

willbradshaw commented 1 week ago

This is not handled in the new pipeline version currently in the dev branch, so remains an active issue.

willbradshaw commented 1 week ago

Further discussion here: https://github.com/naobservatory/mgs-workflow/issues/24

mikemc commented 1 day ago

Note: The below works when running .command.sh from the command line in the work directory, but fails mysteriously when running with nextflow run. Possibly activating extended globbing breaks nextflow.

For being able to handle either _R1.fastq.gz or _1.fastq.gz, we can enable extended globbing and use the ?(R) pattern in the CONCAT_GZIPPED script,

shopt -s extglob
# Get file paths from library IDs
r1=""
r2=""
for l in !{libraries.join(" ")}; do
  L1=$(ls ${read_dir}/*${l}*_?(R)1.fastq.gz)
  L2=$(ls ${read_dir}/*${l}*_?(R)2.fastq.gz)
  r1="${r1} ${L1}"
  r2="${r2} ${L2}"
done

Files output by Illumina software often have an additional field after the R#, e.g. {sample-name}_S3_L001_R1_001.fastq.gz. The following extended glob seems to be working but looks a little ugly and a regex solution might be nicer.

L1=$(ls ${read_dir}/*${l}*_?(R)1?(_[[:digit:]][[:digit:]][[:digit:]]).fastq.gz)
L2=$(ls ${read_dir}/*${l}*_?(R)2?(_[[:digit:]][[:digit:]][[:digit:]]).fastq.gz)