Open willbradshaw opened 1 month ago
Renaming input files to have the suffixes we expect when importing them seems fine? There are a lot of formats we could receive data in (interleaved fastq, bam, nanopore squiggles) and requiring a little munging before data goes into the pipeline doesn't seem to bad to me
I think at a minimum we should be able to handle R1 and R2.
This is not handled in the new pipeline version currently in the dev branch, so remains an active issue.
Further discussion here: https://github.com/naobservatory/mgs-workflow/issues/24
Note: The below works when running .command.sh from the command line in the work directory, but fails mysteriously when running with nextflow run
. Possibly activating extended globbing breaks nextflow.
For being able to handle either _R1.fastq.gz
or _1.fastq.gz
, we can enable extended globbing and use the ?(R)
pattern in the CONCAT_GZIPPED script,
shopt -s extglob
# Get file paths from library IDs
r1=""
r2=""
for l in !{libraries.join(" ")}; do
L1=$(ls ${read_dir}/*${l}*_?(R)1.fastq.gz)
L2=$(ls ${read_dir}/*${l}*_?(R)2.fastq.gz)
r1="${r1} ${L1}"
r2="${r2} ${L2}"
done
Files output by Illumina software often have an additional field after the R#, e.g. {sample-name}_S3_L001_R1_001.fastq.gz
. The following extended glob seems to be working but looks a little ugly and a regex solution might be nicer.
L1=$(ls ${read_dir}/*${l}*_?(R)1?(_[[:digit:]][[:digit:]][[:digit:]]).fastq.gz)
L2=$(ls ${read_dir}/*${l}*_?(R)2?(_[[:digit:]][[:digit:]][[:digit:]]).fastq.gz)
Like the v1 pipeline, this workflow assumes that the input fasta files are names
PREFIX_1.fastq.gz
andPREFIX_2.fastq.gz
. This is convenient to write and configure but insufficiently flexible, especially since many input read files are suffixedR1
andR2
rather than_1
and_2
.nf-core pipeline requires a samplesheet that fully specifies the path to each read file. This seems like overkill, but hopefully there's something intermediate that would work well for our use case.