Open brendanwee opened 3 years ago
Yes indeed the problem is that you would have to lookup and reconstruct the pairs instead of working on the same files. This is not a problem for very small files but for typical data this is typically slowing things down drastically. Hence the very first trimmers allowed it but almost all current ones don't any more. It is probably better to re-order your files ?
As i'm sure you appreciate, handling the various possible corrupted input scenarios is a potential can of worms, and i'm not sure it makes a lot of sense to include it as part of the main workflow of Trimmomatic. It might make a useful accessory tool, if a reasonable but still useful scope can be determined.
Corruption recovery is particularly challenging if the files are compressed - we assume the existing compression-specific recovery tools are used first in these cases. So we simplify to just handling corrupt / incomplete text-based files. We would need to handle at least the following scenarios:
Anything else that you can think of?
BTW, if all you need is to determine the common subset of reads which exist between two paired read files (e.g. one of them is incomplete), Trimmomatic includes an undocumented tool "Pairomatic" for splitting these input files into paired and unpaired. It handles only the cases of missing records, not reordering (which is a bit more challenging):
java -cp
The specified delimiter is used to split the read name before matching the forward and reverse names - for modern Illumina naming, space is fine. For old style /1 /2 names, use "/".
Haha Yes, it is certainly a windy road to head down. And practically speaking, it is a lot of work for ultimately a case I do not believe to be very common.
I agree that limiting the scope to text-based files makes the most sense. I am unable to come up with other scenarios right now. That seems like a complete list to me.
Interesting to hear about Pairomatic. We can give that a try, but our stakeholders are most worried about receiving "Invalid read record content". A use case they have encountered in the past, downloading from NCBI. It doesn't sound like it would be a suitable replacement for our legacy script.
We have a legacy script that is meant to handle some edge cases with corrupt data. I can't say how common these edge cases are, but it would be nice if Trimmomatic would handle these cases.
The input files have two reads: 1628 and 1643
Normally read 1628 passes the filter requirements:
java -jar Trimmomatic-0.39/trimmomatic.jar PE ordered_R1_001.fastq ordered_R2_001.fastq ordered.R1.fastq ordered.R1.new.unp.fastq ordered.R2.fastq ordered.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35
But here if the reads are out of order, and the retained reads are in the unpaired files
java -jar Trimmomatic-0.39/trimmomatic.jar PE unordered_R1_001.fastq unordered_R2_001.fastq unordered.R1.fastq unordered.R1.new.unp.fastq unordered.R2.fastq unordered.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35
java -jar Trimmomatic-0.39/trimmomatic.jar PE error_R1_001.fastq ordered_R2_001.fastq error.R1.fastq error.R1.new.unp.fastq error.R2.fastq error.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35
error_R1_001.fastq.gz ordered_R1_001.fastq.gz ordered_R2_001.fastq.gz unordered_R1_001.fastq.gz unordered_R2_001.fastq.gz