schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
238 stars 56 forks source link

Genome size estimation with HiFi reads #80

Closed niWdooG closed 2 years ago

niWdooG commented 2 years ago

Hello,

I use GenomeScope2 to determine the genome size of A. thaliana with Illumina and HiFi reads. I got a close estimate with Illumina data; however, I am a bit puzzled with the HiFi outcome.

[1] Illumina, 135.6 Mb that is close to the current estimate of 134.6 Mb http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=1rmFuIYQHobDDqGUIO5K [2] PacBio HiFi, 60.2 Mb (default histo) http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=hVSuBwFdvVja7GmXLxN3 [3] PacBio HiFi, 75.8 Mb (histo -h 1000000) http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=yWnRJ3WmBsKo33sCCjhB [4] PacBio HiFi, 164.6 Mb (histo -h 10000000) http://genomescope.org/genomescope2.0/analysis.php?code=LUv3N2omQ57Pvxf5vba2

Could you please give me advice on how to improve the genome size estimation with HiFi reads?

Sincerely, Evgenii

mschatz commented 2 years ago

The genome size is basically estimated by considering the total number of kmers observed (weighted by their observed coverage) and dividing by the average coverage (as determined by the model fit around the main peaks). In the HiFi runs, the average coverage is correctly estimated but as you discovered the genome size estimate is sensitive to the maximum kmer coverage used in the input. This is the expected behavior - if you truncate the histogram early, then real high frequency kmers are excluded especially from satellites, transposable elements, and other repeats leading to underestimating the genome size (as you see in the default HiFi run). On the other hand, if you allow for too many high frequency kmers then it can overcount "contaminating" sequences (as you see in the histo -h 10000000 run). Notice there are a few peaks at around 20,000x coverage (a bit to the right of 1e4 on the log plot). These are likely to organelle sequences (mt and chloroplast) that were present in many copies in the cells you sequenced that are inflating the genome size. For arabidopsis you could screen these out by aligning the raw reads to the references for the organelles - we discuss this in the GenomeScope 1 paper.

For the Illumina data, it looks like the organelle sequences are there with peaks at 1000x and 10,000x coverage (which would inflate the genome size) but the histogram is truncated at 10,000x (which deflates the genome size by excluding real high copy repeats). In a sense you were just lucky these two factors cancelled themselves out. You should see about the same thing with the HiFi data if you had downsampled to a matching amount of coverage and ran GenomeScope with a cutoff of 10,000x.

My general recommendation is to use a high cutoff on the kmer coverage (100,000x or even 1,000,000x) but then screen for any known organelle sequences. For Illumina, you will also need to screen for phiX which is commonly used as a control but can inflate the genome size. Even after this there can be an indeterminate amount of noise so that the estimated genome size can still be ~10% off from the true value

Hope this helps

Mike

On Thu, Jun 30, 2022 at 5:28 PM niWdooG @.***> wrote:

Hello,

I use GenomeScope2 to determine the genome size of A. thaliana with Illumina and HiFi reads. I got a close estimate with Illumina data; however, I am a bit puzzled with the HiFi outcome.

[1] Illumina, 135.6 Mb that is close to the current estimate of 134.6 Mb https://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/agicomplete.jsp#:~:text=The%20Arabidopsis%20thaliana%20genome%20was,size%20of%20approximately%20135%2Dmegabases.

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=1rmFuIYQHobDDqGUIO5K [2] PacBio HiFi, 60.2 Mb (default histo)

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=hVSuBwFdvVja7GmXLxN3 [3] PacBio HiFi, 75.8 Mb (histo -h 1000000)

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=yWnRJ3WmBsKo33sCCjhB [4] PacBio HiFi, 164.6 Mb (histo -h 10000000)

http://genomescope.org/genomescope2.0/analysis.php?code=LUv3N2omQ57Pvxf5vba2

Could you please give me advice on how to improve the genome size estimation with HiFi reads?

Sincerely, Evgenii

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/80, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP347BIWMOJQA6NPTFFKLVRYGPBANCNFSM52KW534A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

niWdooG commented 2 years ago

Hello Mike,

Thank you for such a detailed explanation.

Best, Evgenii