Certain ID schemes can become mixed up with barcodes during sorting

ebolyen commented 5 years ago

Bug Description Discovered by Nick Youngblut on the forum.

If IDs have a scheme which overlaps with barcode segments (e.g. "idX_Y" and just "idX"), the sorting performed on the forward and reverse reads may no longer match IF the barcode IDs differ between forward and reverse reads (as is the case when a fastq manfiest format is used, or in the case where the barcodes are bases etc).

Steps to reproduce the behavior Consider the sorting on these IDs

# file_name: number_of_sequences
./V130_166_L001_R1_001.fastq.gz: 295
./V130_2_167_L001_R1_001.fastq.gz: 6069
./V130_2_743_L001_R2_001.fastq.gz: 6069
./V130_742_L001_R2_001.fastq.gz: 295

This will spuriously pair up V130-forward and V130_2-reverse. Note that the sample ID is V130 NOT V130_166 as it might initially appear.

This results in an error like so:

Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 6069, 295.

Expected behavior For V130_166_L001_R1_001.fastq.gz to be paired with V130_742_L001_R2_001.fastq.gz and so on.

Questions

Is there a better way to provide paired-end data to DADA2? Right now we are providing two directories which DADA2 must be sorting and pairing.
If not, should this bug be moved upstream?
Can we confirm that the MANIFEST file in the data directory has the correct information, if so, could we use that instead of the filenames to determine pairing?

References

https://forum.qiime2.org/t/qiime-dada2-denoise-paired-naming-bug/6385

cc @benjjneb

ebolyen commented 5 years ago

Sorry, it looks like the input are actually vectors, so we can arrange them as needed in the script.

Oddant1 commented 4 years ago

I'm not sure I fully understand the issue, it looks/sounds to me like we have a situation along the lines of this:

There is one directory containing forward reads and one containing reverse reads. These directories are both sorted. The files within the directories are matched up pairwise under the assumption that the forward and reverse reads will have sorted in the same order. In this instance we have a forward read directory that goes 130 then 130_2 because 130_166 sorts above 130_2, but in our reverse read directory we have 130_2 then 130 because 130_2 sorts above 130_742?

If that's the case, ensuring we get proper sorting in all cases sounds like a real pain. I don't 100% know the naming conventions for these files, but I guess we could split on _ and use the resulting elements to aid in sorting? Like in this situation the 130_2 ids would have one more section than the 130 ids so maybe we just put them at the bottom? And then if also had 130_3 or 130_4 or whatever we could do a secondary sort on that second element?

ebolyen commented 4 years ago

You are understanding the problem correctly. What we need to do is find a reasonable way to provide the order to the R script. For instance passing and parsing our internal manifest in R (CSV parsing in R is a bit idiosyncratic, but it's definitely workable). Then using that to define the order/pass the reads into DADA2. Basically we need to move some business logic into R to make sure the samples are paired up correctly.

An alternative would be to rename the files such that only the sample IDs are present, preventing the barcode segment (that second part of the _ in your example) from breaking the ordering between forward and reverse.

Really we can do anything here between the R and Python boundary, we just need to do something...

qiime2 / q2-dada2

Certain ID schemes can become mixed up with barcodes during sorting #102