lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.39k stars 306 forks source link

Differences between setk and Biopython fastq-solexa conversion #48

Closed gittaylor closed 8 years ago

gittaylor commented 9 years ago

Hi Heng,

I am using this toolkit to convert solexa (Illumina <1.3) to the newer format. I compared the output to the output generated by the Bio.SeqIO.convert function (Biopython toolkit) and I am seeing consistent differences of between 1-2 at low quality. Do you know why this might be happening?

Thanks, Taylor

tseemann commented 9 years ago

@gittaylor This is probably due to the fact that Solexa originally used a different formula to convert probabilties to Q values, which is described here: http://en.wikipedia.org/wiki/FASTQ_format#Quality Note how the different formulae only diverge for low qualities. I suspect the Python code is doing a "true" conversion, whereas seqtk is just offsetting the ASCII values.

lh3 commented 8 years ago

As @tseemann said. Seqtk is unable to convert Illumina<1.3 fastq to the standard fastq.