relipmoc / skewer

MIT License
95 stars 17 forks source link

Mis-classifying Sanger+33 FASTQ as Solexa+64 #24

Open tseemann opened 8 years ago

tseemann commented 8 years ago

We have downloaded some Illumina PE reads from SRA and we got the CONTRADICT_FASTQ error.

Both R1 and R2 were in Sanger+33 quality format. However we found in R1 that the first read has a quality symbol K which is Phred 42. Usually Illumina qualities stop at 40 but they can be hire (eg. in Moleculo sequencing etc) which is described here: https://en.wikipedia.org/wiki/FASTQ_format#Encoding

I think you need to adjust the thresholds in the code below to be more flexible in terms of what high Q values you allow for SANGER_FASTQ. Maybe change 74 to 80 ?

                if(chr < 59){
                    format_new = SANGER_FASTQ;
                    break;
                }
                if(chr > 74){
                    format_new = SOLEXA_FASTQ;
                    break;
                }
maciejmotyka commented 2 years ago

I know that this software is not maintained anymore, but it's still in use in some pipelines, so maybe my comment will help somebody debug.

If the situation described above gives error message:

Error: the FASTQ quality formats of input files are different

The solution is to determine the encoding yourself by examining the .fastq files, then you can specify it manually using the -f flag

-f, --format Format of FASTQ quality value: sanger|solexa|auto; (auto)

In my case the first record in the first file was:

@SRR10266853.1 1 length=76
NACACTCCTGCCGGCTGGTCTTGGCCGCTGCCGTCCCTGCAGGCCTGAGCTGGGGGGCTTCGGCCACACTCGGAAC
+
#AAFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKFFKKKFKFFKKKK

skewer saw the # symbol and decided it's Sanger/Illumina 1.8+ encoding (correctly).

First record in the second file was:

@SRR10266853.1 1 length=74
CTCAGACAACGACAGCACAGAGAACGAGGCCCCAGAGCCGAGGGAGAGGGTTCCGAGTGTGGCCGAAGCCCCCC
+
AAAFFKKKKKKKKKKKKKAFKKKKKKKKKKKKFFFKFKKKA7AFKKKKFK,AKKKFF7FAFK7FKFAFFKKKKK

Here skewer proceeded until it saw K and decided it's Solexa/Illumina 1.3+/Illumina 1.5+ encoding while we clearly see that it's Illumina 1.8+. HiSeq 3000/4000 and the X series can produce scores which include K.