google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

Lower number of >Q30 average quality reads for v1.1 compared to v0.3 #54

Closed mason-linscott closed 1 year ago

mason-linscott commented 1 year ago

Hi all,

I am assembling a genome of a land snail that has extreme repeat content (~85%) and large genome size (6.6 gb). My mean insert size is 8kb and I have data from six SMRT cells.

I have ran DeepConsensus (cpu only) on my six SMRT cells using v0.3 and on two of the SMRT cells using v1.1. I have noticed I have gotten more reads with >Q20 average quality using v1.1 but a lower number of reads that are >Q30 compared to v.0.3. Histograms of average read quality attached. v0.3.qchist.txt v1.1.qchist.txt

Manual inspection of the same reads from either version confirmed that most reads had longer regions of lower quality in v1.1 than v0.3. First 100 reads for v0.3 and v1.1 attached (.txt extension for github upload). v0.3_smrtcell1_100.fastq.txt v1.1_smrtcell1_100.fastq.txt

This was surprising to me as my expectation was that the >Q20 yield would remain relatively constant between versions but >Q30 yield would increase.

Perhaps this is the result of lower insert length or high repeat content of this library? I would appreciate hearing the DeepConsensus team's thoughts on this discrepancy. Any help would be appreciated!

Thanks! Mason

AndrewCarroll commented 1 year ago

Hi @mason-linscott

Thank you for the report and for using DeepConsensus. When dealing with questions of base/read quality, one issue to discuss is the difference between empirical (true) quality and estimated quality. Ideally, a program should calibrate its quality estimates exactly with the true underlying quality, but this isn't always easy to do. So one general question is: are you seeing lower Q30 values because the empirical quality is lower, which would be one sort of issue, or because the DeepConsensus v1.1 model is more conservative at the upper-ranges of quality in certain genome contexts you are experiencing, and this difference drives only differences in estimated quality (a different sort of issue).

I assume the current quality distributions you have are from the estimated confidence from DeepConsensus outputs. Do you have any orthogonal measure (e.g. kmer analysis or a reasonable assembly) that might be informative with respect to the empirical quality of the data?

One difference between v0.3 and v1.1 is that we train DeepConsensus with the T2T genome, which extends farther into difficult parts of the human genome than the training in HG002 with v0.3. In both cases, we do try to exclude certain regions which are not mappable or can't be entirely confident in the assembly. All of the feedback we received for empirical quality in human and non-human species does indicate that the T2T training is somewhat better.

It looks like your read qualities at the top end of the distribution are about ~1-2 QV points lower in v1.1, with the Q20-Q25 range rescuing a bit more reads in v1.1. My first instinct is that DeepConsensus v1.1 might have more exposure to harder regions in v1.1 and may have learned to be a bit more conservative in estimating qualities in those regions.

To confirm or reject that possibility, I think we'd need some other estimate of empirical quality or of the calibration between empirical and estimated qualities in these regions.

Thank you, Andrew

mason-linscott commented 1 year ago

Hi Andrew,

Thank you for your detailed response and I believe I understand your reasoning.

You are correct that the quality distributions I reported are the estimated confidence from DeepConsensus outputs. Unfortunately, we do not have a reasonably well-assembled genome assembly for this species or (if I am understanding your kmer analysis point) whole genome Illumina short reads to enable a comparison through kmer analysis. The only other data we have available for this species is 1.6 billion paired end reads of Hi-C data prepared from the same individual. However, it is my understanding that the RE digestion step in our Hi-C prep (Proximo kit from Phase Genomics) results in biased coverage of different portions of the genome and is not suitable for many forms of kmer analysis.

I had a similar gut feeling about DeepConsensus v1.1 model being more conservative for base quality estimations in repetitive regions, due to training on the T2T genome, which compose most of my land snail genome.

As I do not have an empirical measure for comparison, I plan to attempt a second genome assembly using DeepConsensus v1.1 and compare it to our assembly that used DeepConsensus v.0.3 outputs. If you have any other advice, I would appreciate it. Otherwise, I consider this issue closed.

If you or other members of the DeepConsensus team are interested in the v0.3 v1.1 genome assembly comparison, I can post the results here in a couple weeks time.

All the best and thank you for your help, Mason

AndrewCarroll commented 1 year ago

Hi @mason-linscott

Thank you for the fast reply. You understand me correctly when asking about either well-assembled reference or Illumina reads for kmers. Given what you have available, I think a comparison between assemblies does make sense and we'd be quite interested if you can post the statistics when you have them. Comparing assembly quality can be complicated, but it's likely the best we can do in this situation.

Again, we appreciate the feedback and your effort in detailing the issue and analyzing the data.

Wenfei-Xian commented 1 year ago

Hi Mason, A naive question, how you got the v0.3.qchist.txt and v1.1.qchist.txt ? Do you know which reads are Q20 ,which reads are Q30

mason-linscott commented 1 year ago

Hi Wenfei,

I calculated the quality histograms using the BBMap reformat.sh script using the 'aqhist' flag. The same tool can be used to filter average read quality using the 'trimq' flag.

Hope this helps!

I will post my deepC 0.3 vs. 1.1 genome stats soon.

Wenfei-Xian commented 1 year ago

Many thanks !!!

pichuan commented 1 year ago

I'll close this issue. Feel free open again if you have more questions. And please feel free to share any updates later. Thanks!

pxxiao-hz commented 1 year ago

I also had this problem with plants. I counted the quality values in the fastq files of pbccs and deepconsensus, as shown in the figure below. I am puzzled by this result, as stated in the paper, Q30 will increase, but my result was not. Is there something wrong with my process. I‘m looking forward to answer, thanks. ccs_vs_deep

AndrewCarroll commented 1 year ago

Hi @pxxiao-hz

The base quality outputs of pbccs and DeepConsensus are predicted values from each type of model. The degree to which predicted confidence reflects the real or empirical accuracy is called calibration.

We have looked at the calibration of pbccs and DeepConsensus, with a detailed analysis here. In short, pbccs is overconfident in the predicted values when it predicts with confidence 20+. pbccs is less conservative and predictions far more of its bases at Q93 (as you also see in your plot). DeepConsensus is well-calibrated up to a quality of Q35.

Note that the most important place to be well-calibrated is at Q20, as that value is where the cutoff for filtering a HiFi read occurs (and both pbccs and DeepConsensus are both reasonably calibrated at this point).

Our conclusion from this is that DeepConsensus should still producing more accurate reads, it is just being more realistic about the likely quality of the bases (and as a result less over-optimistic) relative to pbccs.

pxxiao-hz commented 1 year ago

Thank you for your prompt answer, @AndrewCarroll. I will use these reads for relevant downstream analysis.