james-e-barrett commented 4 weeks ago

Issue Report

I am trying to understand how the mean quality scores in the qs tag are calculated. Here is one entry from my base called .bam file (loaded into R).

> dat
 [1] "01d312cc-58fd-4ba0-a796-4d9f86f0baea"                                                                                                                                                                                                                                                    
 [2] "4"                                                                                                                                                                                                                                                                                       
 [3] "*"                                                                                                                                                                                                                                                                                       
 [4] "0"                                                                                                                                                                                                                                                                                       
 [5] "0"                                                                                                                                                                                                                                                                                       
 [6] "*"                                                                                                                                                                                                                                                                                       
 [7] "*"                                                                                                                                                                                                                                                                                       
 [8] "0"                                                                                                                                                                                                                                                                                       
 [9] "0"                                                                                                                                                                                                                                                                                       
[10] "ATGTATGTTGCTTGGTTCAGTTACATATTGTCGTTTTCTGCTGGCTTCGAGCAAAACTGCTGATATTAATTCAAGTCTTTCATTTTCCCAATAGTAAGACTAAAGACAAAGAAAAAAAAAGTGAAATTTATTCCAAACTGCTGCAAGTAGGATCAAAGGAATTATTCAAATGCAAACTTTTGGGCTTGATCATCAGTATATTTTTCAACACCTACATCAAAGAATATGGTGTTATCTGATGCCTCTTCACTGTGTATTTGAGAATACTGAAGCAATGGTT"
[11] "#''+%&'+'&##+&('-+(((*(&##%'(54*$%1A>00133$%',6+'&*80.73-.554(('(+'$'(*)2584000:3/24>C?@C225113/./.,.'''(&$&4::-8:JMPQNLIDCGQNA@C.-.7=;=:6:///&#%$&%')4-5/.0FH:;?9=)))'$$4--215:0/49:+*7-%0*)))'''%%;<<=ABELH<9;B@;@B3224/%&270;6.,'(/>>?22.-(().+-9)'(8BABBKDIFFFC7234:.()'(-6/0'((,,($"
[12] "qs:f:10.5601"                                                                                                                                                                                                                                                                            
[13] "du:f:0.735"                                                                                                                                                                                                                                                                              
[14] "ns:i:2940"                                                                                                                                                                                                                                                                               
[15] "ts:i:10"                                                                                                                                                                                                                                                                                 
[16] "mx:i:2"                                                                                                                                                                                                                                                                                  
[17] "ch:i:1638"                                                                                                                                                                                                                                                                               
[18] "st:Z:1970-01-01T00:01:09Z"                                                                                                                                                                                                                                                               
[19] "rn:i:44"                                                                                                                                                                                                                                                                                 
[20] "fn:Z:PAK71464_pass_df3880a7_1567c6a3_1.fast5"                                                                                                                                                                                                                                            
[21] "sm:f:101.437"                                                                                                                                                                                                                                                                            
[22] "sd:f:30.2985"                                                                                                                                                                                                                                                                            
[23] "sv:Z:quantile"                                                                                                                                                                                                                                                                           
[24] "dx:i:0"

If I try to compute the mean quality score myself I get a different value from the qs tag

> mean(utf8ToInt(dat[11])-33)
[1] 16.83571

I'm using dorado version 0.7.0+71cc744. Thank you.

malton-ont commented 4 weeks ago

Hi @james-e-barrett,

The code for mean qscore starts in this function, where we determine how many bases at the beginning of the read to exclude from the calculation. From there, we go to this function. As you can see, the "mean qscore" isn't just the mean of the shifted qstring - we switch back from log space first, calculate the mean error and then convert that to a qscore between 1 and 50.

I hope that helps clarify things.

james-e-barrett commented 4 weeks ago

Hi @malton-ont,

Okay, I see it's not as straightforward as simply using the qstring. Thanks for clarifying that.

In the documentation it says the qs tag is defined as the "mean basecall qscore". I would be grateful if you could you explain this in more detail as it would be helpful to know exactly how to interpret the qs tag values.

Many thanks.

HalfPhoton commented 4 weeks ago

@james-e-barrett, This is the implementation of mean_qscore_from_qstring

As @malton-ont said, we need to convert the log-space qstring into linear scores before taking the average - We then convert back to log-space qscore.

From your GH profile it looks like your preferred language is R so something like:

# Sample list of floats in 10*log10 space
qscores <- c(10.0, 20.0, 30.0) # Example values in 10*log10 space 

# Step 1: Convert qscores to linear space
scores <- 10^(-qscores/10)

# Step 2: Calculate the average in linear space
scores_avg <- mean(scores)

# Step 3: Convert the linear average back to 10*log10 (qscore) log-space 
mean_qscore <- -10*log10(scores_avg)

# Print the result
mean_qscore

james-e-barrett commented 4 weeks ago

@HalfPhoton thank you this is clear.

However, this gives a value of 9.514022 which still differs from the value of 10.5601 in the qs tag. Based on what @malton-ont said some of the bases are skipped. Why is this? Many thanks

HalfPhoton commented 3 weeks ago

@james-e-barrett, Continuing: reply from @malton-ont

In dorado we define mean_qscore_start_pos which is an index in the qstring where we start the mean qscore calculation from. We trim the qscore very slightly depending on the model here and on the strand type (RNA/DNA) here

If your using DNA this value is probably 10 meaning we trim the first 10 bases which as the signal is noisy at the very start of the read and then quickly settles down. There are a few subtleties (e.g. handling reads shorter than 10) which you can view in the code yourself).

That doesn't appear to be the case here - so I'm assuming you're basecalling RNA?

james-e-barrett commented 3 weeks ago

@HalfPhoton ah okay I see. In the example above it is DNA. This isn't particularly important, I was just curious how the values were calculated.

nanoporetech / dorado

How are qs tag values calculated? #865

Issue Report