Closed james-e-barrett closed 3 weeks ago
Hi @james-e-barrett,
The code for mean qscore starts in this function, where we determine how many bases at the beginning of the read to exclude from the calculation. From there, we go to this function. As you can see, the "mean qscore" isn't just the mean of the shifted qstring - we switch back from log space first, calculate the mean error and then convert that to a qscore between 1 and 50.
I hope that helps clarify things.
Hi @malton-ont,
Okay, I see it's not as straightforward as simply using the qstring. Thanks for clarifying that.
In the documentation it says the qs
tag is defined as the "mean basecall qscore". I would be grateful if you could you explain this in more detail as it would be helpful to know exactly how to interpret the qs tag values.
Many thanks.
@james-e-barrett, This is the implementation of mean_qscore_from_qstring
As @malton-ont said, we need to convert the log-space qstring into linear scores before taking the average - We then convert back to log-space qscore.
From your GH profile it looks like your preferred language is R so something like:
# Sample list of floats in 10*log10 space
qscores <- c(10.0, 20.0, 30.0) # Example values in 10*log10 space
# Step 1: Convert qscores to linear space
scores <- 10^(-qscores/10)
# Step 2: Calculate the average in linear space
scores_avg <- mean(scores)
# Step 3: Convert the linear average back to 10*log10 (qscore) log-space
mean_qscore <- -10*log10(scores_avg)
# Print the result
mean_qscore
@HalfPhoton thank you this is clear.
However, this gives a value of 9.514022 which still differs from the value of 10.5601 in the qs
tag. Based on what @malton-ont said some of the bases are skipped. Why is this? Many thanks
@james-e-barrett, Continuing: reply from @malton-ont
In dorado we define mean_qscore_start_pos
which is an index in the qstring where we start the mean qscore calculation from. We trim the qscore very slightly depending on the model here and on the strand type (RNA/DNA) here
If your using DNA this value is probably 10
meaning we trim the first 10 bases which as the signal is noisy at the very start of the read and then quickly settles down. There are a few subtleties (e.g. handling reads shorter than 10) which you can view in the code yourself).
That doesn't appear to be the case here - so I'm assuming you're basecalling RNA?
@HalfPhoton ah okay I see. In the example above it is DNA. This isn't particularly important, I was just curious how the values were calculated.
Issue Report
I am trying to understand how the mean quality scores in the
qs
tag are calculated. Here is one entry from my base called .bam file (loaded into R).If I try to compute the mean quality score myself I get a different value from the
qs
tagI'm using dorado version 0.7.0+71cc744. Thank you.