shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.28k stars 157 forks source link

Fastq sequence average quality results are different #328

Closed cfc424 closed 2 years ago

cfc424 commented 2 years ago

Prerequisites

./seqkit version seqkit v2.3.0

Describe your issue

Hi, my question is: Fastq sequence average quality calculation results are different from manually calculated and seqtk results

Fastq test file seqkit.test.fq:

@a6ee3339-e0ab-43c6-a7a3-a63149139af8 runid=25729ae17680edd32e6bf0be49090e5576fb01b7 read=5 ch=262 start_time=2020-09-22T09:35:40Z flow_cell_id=PAF23868 protocol_group_id=basecalled sample_id=20200922-UNL139-P1-PAF23868
CAAATTGCTTTCGTTCAGTTACGTATTGCTTTCTAAAATGGATTAAATATCTGGGGTCCCAGTCTACCAATGGAATTCTGTCTGCTCAAATTCTAGAATAAAAACGCTTCAGATTCATCTCACCTTTACAAATTTGAATTTCTATTTGTTGAGTAATAACTTAACCCTTCCAATAAAAGCTCCTAAAAAAAAAAGTCTTAAAGCCATCGTGGTTTTCGATTA
+
10.%%$%'()+%*.5:(3045+,.+..8>FF6:-CFDH;>::??A@6:0+:%$&%7-.10/''(%+1188<<=@B3:5;8;;=79<8;8:>B=6E('+()((*('&"#%%$'.1/.69715*,--2.-109@HGD800/:D?90CDB&@>2/18331/.###779:::;%*(%%,.'&#,1028@IH>952/..**&,+&6.,--844686;:;:31:999#

seqkit calculate command and results:

./seqkit fx2tab -H  -alignq -b 33  seqkit.test.fq 

#id length  GC  alphabet    avg.qual
a6ee3339-e0ab-43c6-a7a3-a63149139af8    222 32.88   ACGT    10.24

the result show that average quality is 10.24.

I aslo used seqtk:

seqtk fqchk  seqkit.test.fq 
min_len: 222; max_len: 222; avg_len: 222.00; 40 distinct quality values
POS #bases  %A  %C  %G  %T  %N  avgQ    errQ    %low    %high
ALL 222 32.4    19.4    13.5    34.7    0.0 17.7    10.5    55.9    44.1

the result show that average quality is 17.7, which is larger than seqkit results.

and I tested the results with python3 ord function, average quality is 17.7117:

a="""10.%%$%'()+%*.5:(3045+,.+..8>FF6:-CFDH;>::??A@6:0+:%$&%7-.10/''(%+1188<<=@B3:5;8;;=79<8;8:>B=6E('+()((*('&"#%%$'.1/.69715*,--2.-109@HGD800/:D?90CDB&@>2/18331/.###779:::;%*(%%,.'&#,1028@IH>952/..**&,+&6.,--844686;:;:31:999#"""
x=[]
for i in a:
    x.append(ord(i)-33)
print(x)
print(sum(x)/len(x))

So, am I missed somthing in calculating average quality with seqkit ? and I also want to know if I can use seqkit to calculate nanopore quality?

Thanks,

Fengchao

shenwei356 commented 2 years ago

https://github.com/shenwei356/seqkit/issues/297#issuecomment-1085639766

you can’t just do simple arithmetic mean of all the qscores, because it won’t be a representation of the mean error rate then.

https://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/

cfc424 commented 2 years ago

I'll read it, thanks a lot.