Closed ebolyen closed 1 year ago
Sorry, it looks like the input are actually vectors, so we can arrange them as needed in the script.
I'm not sure I fully understand the issue, it looks/sounds to me like we have a situation along the lines of this:
There is one directory containing forward reads and one containing reverse reads. These directories are both sorted. The files within the directories are matched up pairwise under the assumption that the forward and reverse reads will have sorted in the same order. In this instance we have a forward read directory that goes 130 then 130_2 because 130_166 sorts above 130_2, but in our reverse read directory we have 130_2 then 130 because 130_2 sorts above 130_742?
If that's the case, ensuring we get proper sorting in all cases sounds like a real pain. I don't 100% know the naming conventions for these files, but I guess we could split on _
and use the resulting elements to aid in sorting? Like in this situation the 130_2 ids would have one more section than the 130 ids so maybe we just put them at the bottom? And then if also had 130_3 or 130_4 or whatever we could do a secondary sort on that second element?
You are understanding the problem correctly. What we need to do is find a reasonable way to provide the order to the R script. For instance passing and parsing our internal manifest in R (CSV parsing in R is a bit idiosyncratic, but it's definitely workable). Then using that to define the order/pass the reads into DADA2. Basically we need to move some business logic into R to make sure the samples are paired up correctly.
An alternative would be to rename the files such that only the sample IDs are present, preventing the barcode segment (that second part of the _ in your example) from breaking the ordering between forward and reverse.
Really we can do anything here between the R and Python boundary, we just need to do something...
Bug Description Discovered by Nick Youngblut on the forum.
If IDs have a scheme which overlaps with barcode segments (e.g. "idX_Y" and just "idX"), the sorting performed on the forward and reverse reads may no longer match IF the barcode IDs differ between forward and reverse reads (as is the case when a fastq manfiest format is used, or in the case where the barcodes are bases etc).
Steps to reproduce the behavior Consider the sorting on these IDs
This will spuriously pair up
V130
-forward andV130_2
-reverse. Note that the sample ID isV130
NOTV130_166
as it might initially appear.This results in an error like so:
Expected behavior For
V130_166_L001_R1_001.fastq.gz
to be paired withV130_742_L001_R2_001.fastq.gz
and so on.Questions
References
cc @benjjneb