s-andrews / FastQC

A quality control analysis tool for high throughput sequencing data
GNU General Public License v3.0
427 stars 84 forks source link

Bug: FastQC averages phred scores naively #110

Open rhpvorderman opened 1 year ago

rhpvorderman commented 1 year ago

Phred quality scores are log scores. A quality of 30 stands for a 1 in 1000 error rate. A quality of 10 stands for a 1 in 10 error rate. The average of 10 and 30 is therefore -10 * log10((0.1 + 0.001) /2) ~= 13. Not (10 + 30) / 2 = 20. On actual realworld data the average quality between the naive method and the proper can differ more than 10 Phred units. That is an overestimation of the quality by a factor of 10!

s-andrews commented 1 year ago

Thanks for reporting this. This is actually not an oversight but an issue we've gone back and forth on and the current behaviour is intended. If you hunt in closed issues I think there are threads where we've gone through this before.

Mathematically I completely agree that if you're using this to calculate overall error rates then the current behaviour (mean of logs rather than log of mean) will give an incorrect overall estimate of the expected error level, however I'm not sure that this is how the data is best used in practice, and I'm keen that the visual impression given by the plots in the program is as useful as it can be.

The other issue to consider here is that base call p-values operate on a very skewed scale and do not exhibit normal behaviour. The reason that Phred scores are so widely used is because the scaling of p-values is difficult to interpret and plot. Ultimately a mean is only really a useful summary of values which are at least vaguely normal in their distribution, and this applies much better to Phred scores than raw p-values. The plots we make which include means (the per-base quality and per-sequence quality and to some extent the per-tile quality) show useful summaries and deviations when plotting mean Phred scores which help diagnose real problems in libraries, and these would not be as clear or useful if the averaging took place on the linear p-value scale.

I'm always open to suggestions for improvement or clarification, but this one isn't a case where there's an obviously correct and best answer I'm afraid.

rhpvorderman commented 1 year ago

The plots we make which include means (the per-base quality and per-sequence quality and to some extent the per-tile quality) show useful summaries and deviations when plotting mean Phred scores which help diagnose real problems in libraries, ...

I agree on that part. The deviation bars that FASTQC creates also creates a clear impression that lots of bases have low quality if that case arises.

and these would not be as clear or useful if the averaging took place on the linear p-value scale.

I am inclined to disagree here. Different averaging leads to different conclusions. This is the FastQC plot for some generic illumina data:

image

This is a plot generated using the correct way of averaging

image

These graphs tell two different stories. The FastQC graph tells a story of consistent high quality data that has quite some variation in quality only at the tail end. The correct averaging graph tells a story of continuously declining quality until the end. Also the reported average quality hugely differs between the two graphs. FastQC reports a stellar average of 36 across most of the length while the correct graph shows that it is still good, but not that good.

This is because low phred scores when averaging naively have very little influence on the result. Let's imagine I have 99 perfect bases with phred score 93. Now I add one base that I know is wrong. I call it N and I give it a phred score of 0. The naively averaged phred score is still close to 93. The actual p-value is 1 in 100 is wrong -> 20 phred score. As a result the FastQC plot is almost a flat line or the first 140 bases, which is not what happens in the actual data. Illumina quality declines across the read length.

Anyway I do agree that using the correct averaging score might not actually lead to better problem finding in libraries. Especially since everybody is used to the FastQC graphs as they are. This does not mean, in my opinion, that averaging correctly can be dismissed as "not clear or useful". I think it brings out the actual quality patterns in the reads more clearly.

I'm always open to suggestions for improvement or clarification, but this one isn't a case where there's an obviously correct and best answer I'm afraid.

Yes indeed. Your answer has been very informative. Thank you.