sample ids/file naming not fully interpreted for merging of paired end reads

cajwalsh commented 11 months ago

Possibly related to: #35

I had fastq files with the exact same names from two different sequencing runs (the first one have usable data but was considered a lower quality/failed run, so they redid it with the exact same samples and names). Before putting files from both runs in the same directory, I needed to change the names of one set. I added "rerun" to the filename right before the extension.

An example of the reads belonging to one sample would be:

230929-MiFishU-COL-075-B3_S29_L001_R1_001.fastq.gz
230929-MiFishU-COL-075-B3_S29_L001_R2_001.fastq.gz
230929-MiFishU-COL-075-B3_S29_L001_R1_001_rerun.fastq.gz
230929-MiFishU-COL-075-B3_S29_L001_R2_001_rerun.fastq.gz

With this first set, AdapterRemoval failed as it was trying to merge runs from different places on the sequencer, and this was because eDNAFlow was pairing one of the files without "rerun" with a one of the "rerun" files. This made me notice that eDNAFlow is only looking at file names to the R1 or R2 and not beyond there. When I moved the "rerun" to just before L001, eDNAFlow paired all of the correct version of each sample to merge. If this is not something that could or should be changed (to interpret all the way to the end of the file name or at least to the extension), then adding a bit about it in the documentation or help would be useful.

mhoban commented 8 months ago

This is a good catch. I don't think it would be trivial to make the pipeline work with filenames like this, so I think I'll put a note in the documentation that you should keep your sequencing runs in separate directories (especially if they have largely overlapping filenames).

mhoban commented 8 months ago

Ok, I added a note to the readme. It's maybe not as prominent as it ought to be, but we can address that down the line.

mhoban / rainbow_bridge

sample ids/file naming not fully interpreted for merging of paired end reads #40