schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
254 stars 57 forks source link

non-error peak potentially marked as error? #79

Closed pbfrandsen closed 2 years ago

pbfrandsen commented 2 years ago

Dear Genomescope developers, thank you for the great tool. We use it a lot and it's great! I had an interesting result recently, which resulted in a smaller genome size estimate than we would expect (nearly half the size). I noticed that the error curve was fit to one of the smaller peaks. I wondered if this could be the reason for the lower-than-expected size estimate. Any thoughts on whether this is the case/whether there is a parameter that we might change to avoid that?

http://genomescope.org/genomescope2.0/analysis.php?code=FsUZp8wcQEMdj4JJUvol

Many thanks,

Paul Frandsen

mschatz commented 2 years ago

Hi Paul,

I agree this looks fishy to me. I think the automatic fit got confused by the very high coverage available. I got a much better fit by giving the model a hint as to where the peaks should be by setting "Average k-mer coverage for polyploid genome" to 72: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=KlXwOqo0GWE33TTYKcJ8

This shows a (haploid) genome size of about 94Mbp with a 0.35% heterozygosity rate. Does this seem reasonable? I noticed your kmer histogram is cut off at 10,000x coverage so may be underestimating the genome size a bit since it will exclude very high copy repeats in satellites and centromeres. For a more robust estimate I would recommend increasing this to 100,000 or more.

Also, if you are planning to assemble these reads, Id recommend you randomly downsample so that the main peak (mode of the distribution) is around ~50x to 100x. Beyond this range, assemblers tend to get confused and can give a poorer assembly.

Good luck

Mike

On Wed, Jun 29, 2022 at 12:22 PM Paul Frandsen @.***> wrote:

Dear Genomescope developers, thank you for the great tool. We use it a lot and it's great! I had an interesting result recently, which resulted in a smaller genome size estimate than we would expect (nearly half the size). I noticed that the error curve was fit to one of the smaller peaks. I wondered if this could be the reason for the lower-than-expected size estimate. Any thoughts on whether this is the case/whether there is a parameter that we might change to avoid that?

http://genomescope.org/genomescope2.0/analysis.php?code=FsUZp8wcQEMdj4JJUvol

Many thanks,

Paul Frandsen

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/79, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342N7T3MFNKQVCRDUADVRRZ5PANCNFSM52GLMY3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pbfrandsen commented 2 years ago

Thanks, Mike! That is really close to the published estimates. Much more in line with what we expect. Many thanks for the help.