Feature Request: Reads out of order + Seq , qual line length mismatch

brendanwee commented 3 years ago

We have a legacy script that is meant to handle some edge cases with corrupt data. I can't say how common these edge cases are, but it would be nice if Trimmomatic would handle these cases.

When reads are out of order, Trimmomatic does not associate mate pairs correctly. example below

The input files have two reads: 1628 and 1643

Normally read 1628 passes the filter requirements:

java -jar Trimmomatic-0.39/trimmomatic.jar PE ordered_R1_001.fastq ordered_R2_001.fastq ordered.R1.fastq ordered.R1.new.unp.fastq ordered.R2.fastq ordered.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35

Input Read Pairs: 2 Both Surviving: 1 (50.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 1 (50.00%)

But here if the reads are out of order, and the retained reads are in the unpaired files

java -jar Trimmomatic-0.39/trimmomatic.jar PE unordered_R1_001.fastq unordered_R2_001.fastq unordered.R1.fastq unordered.R1.new.unp.fastq unordered.R2.fastq unordered.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35

Input Read Pairs: 2 Both Surviving: 0 (0.00%) Forward Only Surviving: 1 (50.00%) Reverse Only Surviving: 1 (50.00%) Dropped: 0 (0.00%)

When the sequence line and quality line are different length, trimmomatic errors out. It would be nicer if it just removed the read.

java -jar Trimmomatic-0.39/trimmomatic.jar PE error_R1_001.fastq ordered_R2_001.fastq error.R1.fastq error.R1.new.unp.fastq error.R2.fastq error.R2.new.unp.fastq ILLUMINACLIP:trimmomatic_0.38_adapters_ALL-PE.fa:1:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:35

Exception in thread "main" java.lang.RuntimeException: Sequence and quality length don't match: 'TTTAGCAGCCATTTTAGCTTTCTGCCGGATTTTTGCAACGATACCTTTCATGTGAACATTGTGAACATTAAACTGGCGAGCCAGTTTTTTACCCATATGACCAGTGGCCCGTGAGGCCAACACCATGGTGGTGGTTACATCTCGTATGCCG' vs '1>>11111BA1DGGG31BGGGEFADB0A0EEGHH0F2BB?CG/AEGHHFFGEGD2DF1GFAFB2AF1FHF1FFG10AE//>>CHHHHHGGFFG0FGFHBFGHGAFFBCFEE///<GEF/C<CF</GB0<</?//BFDCBHFHD0FCC01<'

error_R1_001.fastq.gz ordered_R1_001.fastq.gz ordered_R2_001.fastq.gz unordered_R1_001.fastq.gz unordered_R2_001.fastq.gz

BjoernUsadel commented 3 years ago

Yes indeed the problem is that you would have to lookup and reconstruct the pairs instead of working on the same files. This is not a problem for very small files but for typical data this is typically slowing things down drastically. Hence the very first trimmers allowed it but almost all current ones don't any more. It is probably better to re-order your files ?

TonyBolger commented 2 years ago

As i'm sure you appreciate, handling the various possible corrupted input scenarios is a potential can of worms, and i'm not sure it makes a lot of sense to include it as part of the main workflow of Trimmomatic. It might make a useful accessory tool, if a reasonable but still useful scope can be determined.

Corruption recovery is particularly challenging if the files are compressed - we assume the existing compression-specific recovery tools are used first in these cases. So we simplify to just handling corrupt / incomplete text-based files. We would need to handle at least the following scenarios:

Invalid read record structure - not the normal 4 line format per record. This also requires some kind of resync mechanism to recover later reads in the file.
Invalid read record content - e.g. invalid characters in the name/sequence/quality lines, mismatch between sequence / quality length etc.
Inconsistent record names between paired files, due to missing records, invalid records that got dropped etc. It's also possible that records got re-ordered somehow (forward / reverse data appended differently, maybe?)

Anything else that you can think of?

BTW, if all you need is to determine the common subset of reads which exist between two paired read files (e.g. one of them is incomplete), Trimmomatic includes an undocumented tool "Pairomatic" for splitting these input files into paired and unpaired. It handles only the cases of missing records, not reordering (which is a bit more challenging):

java -cp org.usadellab.trimmomatic.Pairomatic -delim " " Input1.fq Input2.fq Out_1P.fq Out_1U.fq Out2P.fq Out2U.fq

The specified delimiter is used to split the read name before matching the forward and reverse names - for modern Illumina naming, space is fine. For old style /1 /2 names, use "/".

brendanwee commented 2 years ago

Haha Yes, it is certainly a windy road to head down. And practically speaking, it is a lot of work for ultimately a case I do not believe to be very common.

I agree that limiting the scope to text-based files makes the most sense. I am unable to come up with other scenarios right now. That seems like a complete list to me.

Interesting to hear about Pairomatic. We can give that a try, but our stakeholders are most worried about receiving "Invalid read record content". A use case they have encountered in the past, downloading from NCBI. It doesn't sound like it would be a suitable replacement for our legacy script.

usadellab / Trimmomatic

Feature Request: Reads out of order + Seq , qual line length mismatch #13