lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

seqtk mergepe produces invalid FASTQ output on empty input sequences #109

Closed tskir closed 6 years ago

tskir commented 6 years ago

When one of the sequences to merge is empty (has zero length), mergepe produces output which isn't a valid FASTQ file. Example:

1.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 1:N:0:GAGATTCC+CCTATCCT
GTGGGCACCTGTCATCCCAGCTACTTGGGAGGCTGAGTCAGGAGAATCACTGGAACCTAGGAGTTGGAGTTTGTAGTGAGCCAAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAATGAGAC
+
</<<B<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<FBFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 1:N:0:GAGATTCC+CCTATCCT
GGAAGGGGAAAGACAAGAAAGAGGGAGACGGAGAGAGAGAGAGAAGGAAAAGGAAGGATGACAAGACAGGAAGACAGAGAAAAGAAAAAGACTGAAGGAGGGAAGGAAGGAAAGAAGGGAAGG
+
B/<<BFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFBFFFFFFFFF/FFFFBFFFFFFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3606:29320 1:N:0:GAGATTCC+CCTATCCT
GGCCCACTCTGTCTGGCTCCCAACTTCCCTTTGAGGCTTCCCGGTCCTGGGGTAACAGCTGGCTTAGGGGAGCCAGTCGGGGCTGCCTTCCAGGCTTCCAGGTCCACTTTAAGGAGGGATGGGG
+
B<BBBFFFFFFFFFFFFFFFFBFFFFFFFFFFBB/<BFFFFFFFBFFFFFFFBFFF//FFFFFFFFFF/BFFFBFFFFBFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF7/B<<FFFFFB

2.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 2:N:0:GAGATTCC+CCTATCCT
GTTGTGCAGCCCTCAGCGCCACCCACCACCAGAACTCTCCCCACCTTTGACGCACGTGTGCAGAACTGAAGGCCACGCTGGAATTGTAGCCCACGTGCCACACTGCTGGCTGTGTCATGGATATC
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFF/FFBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFF<B//B<FFFFFFFFFFFFFFFFBFFFFF/BB/F/B/B7/<FFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT

+

@D00723:158:CAJ17ANXX:6:1103:3606:29320 2:N:0:GAGATTCC+CCTATCCT
GAGGACCAGCAGCTGGACATCCAGGTCATGGCAGAGGCCAGAGAGTCCTGGGACCTGGGCCTACAGGAGCAGGAGGGGCGGTACACACCTCTGCCCTTGGGCGGGAACAAGGAGCAAGCCATCT
+
/BBBBFFFFFFFFFFBFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFF<FFFFFBFFFFBFFFFFFFFFF<FFBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFBBFFFFFFFF<FFFFFFB/

Output of seqtk mergepe 1.fq 2.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 1:N:0:GAGATTCC+CCTATCCT
GTGGGCACCTGTCATCCCAGCTACTTGGGAGGCTGAGTCAGGAGAATCACTGGAACCTAGGAGTTGGAGTTTGTAGTGAGCCAAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAATGAGAC
+
</<<B<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<FBFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3343:29327 2:N:0:GAGATTCC+CCTATCCT
GTTGTGCAGCCCTCAGCGCCACCCACCACCAGAACTCTCCCCACCTTTGACGCACGTGTGCAGAACTGAAGGCCACGCTGGAATTGTAGCCCACGTGCCACACTGCTGGCTGTGTCATGGATATC
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFF/FFBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFF<B//B<FFFFFFFFFFFFFFFFBFFFFF/BB/F/B/B7/<FFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 1:N:0:GAGATTCC+CCTATCCT
GGAAGGGGAAAGACAAGAAAGAGGGAGACGGAGAGAGAGAGAGAAGGAAAAGGAAGGATGACAAGACAGGAAGACAGAGAAAAGAAAAAGACTGAAGGAGGGAAGGAAGGAAAGAAGGGAAGG
+
B/<<BFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFBFFFFFFFFF/FFFFBFFFFFFFFFF<
>D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT

@D00723:158:CAJ17ANXX:6:1103:3606:29320 1:N:0:GAGATTCC+CCTATCCT
GGCCCACTCTGTCTGGCTCCCAACTTCCCTTTGAGGCTTCCCGGTCCTGGGGTAACAGCTGGCTTAGGGGAGCCAGTCGGGGCTGCCTTCCAGGCTTCCAGGTCCACTTTAAGGAGGGATGGGG
+
B<BBBFFFFFFFFFFFFFFFFBFFFFFFFFFFBB/<BFFFFFFFBFFFFFFFBFFF//FFFFFFFFFF/BFFFBFFFFBFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF7/B<<FFFFFB
@D00723:158:CAJ17ANXX:6:1103:3606:29320 2:N:0:GAGATTCC+CCTATCCT
GAGGACCAGCAGCTGGACATCCAGGTCATGGCAGAGGCCAGAGAGTCCTGGGACCTGGGCCTACAGGAGCAGGAGGGGCGGTACACACCTCTGCCCTTGGGCGGGAACAAGGAGCAAGCCATCT
+
/BBBBFFFFFFFFFFBFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFF<FFFFFBFFFFBFFFFFFFFFF<FFBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFBBFFFFFFFF<FFFFFFB/

Note how @D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT from the second file was transformed into >D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT in the output.

tskir commented 6 years ago

I tested this on the latest release, seqtk-1.2-r94, but the latest master appears to be affected by this as well.

lh3 commented 6 years ago

seqtk doesn't and won't work with zero-length records.

tskir commented 6 years ago

@lh3, OK, this is your right as the author of the software. However I just want to point out two things for the record:

  1. I did not tamper with the reads, they were taken directly out of an Illumina HiSeq (not sure which model, but one of the older ones) output after internal demultiplexing. Which means that seqtk doesn't support some real-world FASTQ files.
  2. When encountering zero-length records, seqtk doesn't crash or warn of what it considers to be invalid input, it just quietly produces invalid output which will break any downstream processing. This might not be ideal.
shenwei356 commented 6 years ago

Since seqtk is open source, we can freely modify the code. If the change helps, a pull request is welcome to contribute it.

I don't write C/C++, but the seqtk source code is easy to read. Adding one line in front of https://github.com/lh3/seqtk/blob/master/seqtk.c#L1412 can easily solve this , by discarding reads with zero-length seq in one pair:

if (seq[0]->seq.l == 0 || seq[1]->seq.l == 0) continue; 
lh3 commented 6 years ago

@tskir zero-length records directly from Illumina pipeline are extremely rare. This is also mostly Illumina's issue. I didn't say zero-length records are "invalid". Seqtk actually parsed the record, though outputted it as FASTA. Such output is still compatible with seqtk/bwa or other tools that use the kseq.h parser, as long as they work with 0-length sequences.

@shenwei356 thanks a lot for the suggestion, but this discards data and does not address the same issue in other part of seqtk. It is in theory possible to implement a more proper fix with some significant changes. However, given that this is a rare issue, I am not convinced that the efforts will be paid off. Generally, I fix issues that may affect a lot of users. I don't have the bandwidth to take care of every corner cases.