Genome size underestimated comparing assembled data

paul-bio commented 1 year ago

Hi.

When I performed a de novo assembly using HIFI, I obtained a genome size of 1.7 Gb. Since there was no prior information available about my sample, I decided to estimate the genome size. When I look up the Animal Genome Size Database (http://www.genomesize.com), I found the C-value to be 1.93. Based on this information, I calculated the genome size to be 1.93 pg * 978 Mb, which is 1.87 Gb.

I recently performed short-read sequencing to estimate the genome size using a k-mer based approach. However, when I used GenomeScope2 with a 21-mer analysis, the haploid genome size was identified as 939 Mb, which is half the size of the estimated genome size.

Is there any suggestion?

mschatz commented 1 year ago

Thanks for your interest. There are two well defined peaks so it assumes those are the heterozygous and homozygous kmers at 90x and 180x coverage, respectively, with a high rate of heterozygosity (2.25%). But if your sample is inbred or generally has low rates of heterozygosity (~0.1% or less) the peak at 90x coverage could really be the homozygous peak with an insignificant peak at 45x for heterozygous kmers and the peak at 180x is really from repeats. If so, this would double the overall genome size to match the reported c-value. Id recommend you run BUSCO on your HiFi assembly and see if you have extensive amounts of duplicated genes - if you do, then you probably have a highly heterozygous sample and the GenomeScope estimate is reliable. If you dont have extensive duplicated genes, we will need to optimize GenomeScope for your data (there are some parameters that can be tuned for this)

Hope this helps

Mike

On Mon, Jun 19, 2023 at 1:35 AM paul-bio @.***> wrote:

Hi.

When I performed a de novo assembly using HIFI, I obtained a genome size of 1.7 Gb. Since there was no prior information available about my sample, I decided to estimate the genome size. When I look up the Animal Genome Size Database (http://www.genomesize.com), I found the C-value to be 1.93. Based on this information, I calculated the genome size to be 1.93 pg

978 Mb, which is 1.87 Gb.

I recently performed short-read sequencing to estimate the genome size using a k-mer based approach. However, when I used GenomeScope2 with a 21-mer analysis, the haploid genome size was identified as 939 Mb, which is half the size of the estimated genome size. [image: image] https://user-images.githubusercontent.com/73869977/246734088-c45cd01f-d0f9-4508-8023-4cdceb41138a.png

Is there any suggestion?

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3475UNIONBZLAQGNUNDXL7QLXANCNFSM6AAAAAAZLNFMOM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

paul-bio commented 1 year ago

Hi @mschatz , thanks for the fast feedback. I appreciate that.

Given that my target sample is small, we gathered individuals (5 for short-read sequencing, 20 for HIFI sequencing), none of which are inbred lines.

Regarding the assembled results, the BUSCO analysis yielded 95.8%, 82.1%, and 13.7% values for the C, S, and M, respectively. My sample appears to contain some duplicated genes, but the duplication does not appear to be that extensive.

My colleague also obtained a similar genome assembly size of 1.9G from their data. Mine is 1.69G.

If my assembly appears satisfactory, could you provide some suggestions for optimization?

mschatz commented 1 year ago

Can you send the hist file or link to genomescope results?

Mike

On Tue, Jun 20, 2023 at 12:52 AM paul-bio @.***> wrote:

Hi @mschatz https://github.com/mschatz , thanks for the fast feedback. I appreciate that.

Given that my target sample is small, we gathered individuals (5 for short-read sequencing, 20 for HIFI sequencing), none of which are inbred lines.

Regarding the assembled results, the BUSCO analysis yielded 95.8%, 82.1%, and 13.7% values for the C, S, and M, respectively. My sample appears to contain some duplicated genes, but the duplication does not appear to be that extensive.

My colleague also obtained a similar genome assembly size of 1.9G from their data. Mine is 1.69G.

If my assembly appears satisfactory, could you provide some suggestions for optimization?

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598112813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP347B3SXRQVAKVISM2ZLXMEUARANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>

paul-bio commented 1 year ago

Here is the genomescope result http://genomescope.org/genomescope2.0/analysis.php?code=XWpIeAgeg4tXJEIO5QSA

And for the result generated by using -h 1,000,000 options http://genomescope.org/genomescope2.0/analysis.php?code=TsqhKbj9tLFu3fd7k4Jm

When consider the results above, I think cutoff limit of 1,000,000 histo version seems better for me.

And it would be better for me to get genome size of 1.6-1.9 Gb. Thank you @mschatz

mschatz commented 1 year ago

For low heterozygous samples, you will need to use the "Average k-mer coverage for polyploid genome" parameter as a hint for which peak is really the heterozygous kmers vs homozygous kmers. Here I set it to 45x and get an estimate of 1.87Gbp (with essentially zero heterozygosity): http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=GhVo8JXrNd5mSkqPJxsl

Using the full distribution, the estimated genome size rises to 2.3Gbp but there is a suspicious rise in the kmer coverage for very high frequency kmers. These are almost always technical issues (contamination, poor quality low complexity reads, etc), so would expect the truth to be closer to 1.9Gb than 2.3Gb.

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=aJT8KN5XeaBzWsw47ZxN

Hope this helps! Mike

On Tue, Jun 20, 2023 at 9:40 AM paul-bio @.***> wrote:

Here is the genomescope result

http://genomescope.org/genomescope2.0/analysis.php?code=XWpIeAgeg4tXJEIO5QSA

And for the result generated by using -h 1,000,000 options

http://genomescope.org/genomescope2.0/analysis.php?code=TsqhKbj9tLFu3fd7k4Jm

When consider the results above, I think cutoff limit of 1,000,000 histo version seems better for me.

And it would be better for me to get genome size of 1.6-1.9 Gb. Thank you @mschatz https://github.com/mschatz

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598814046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342FY3KW4SMD3R3KU4LXMGR47ANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>

paul-bio commented 1 year ago

Wow, this seems fantastic to me. Thanks again for your help :) I really appreciate it.

In some cases, when the heterozygosity is low, there can be an insignificant peak. And in my case, you mentioned that it can be 45. How did you determine that the heterozygous peak is at 45x and the homozygous peak is at 90x?

mschatz commented 1 year ago

By eye, the tallest peak is at about 90x coverage. If we assume these are the homozygous kmers then the heterozygous ones will be half this coverage. Note you dont need to input the exact value, GenomeScope will take it as a "hint" to find the best values around the hint. Here is actually finds the heterozygous kmers have a coverage of 44x and the main peak thus has 88x

Good luck

Mike

On Tue, Jun 20, 2023 at 10:14 AM paul-bio @.***> wrote:

Wow, this seems fantastic to me. Thanks again for your help :) I really appreciate it.

In some cases, when the heterozygosity is low, there can be an insignificant peak. And in my case, you mentioned that it can be 45. How did you determine that the heterozygous peak is at 45x and the homozygous peak is at 90x?

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598879668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342O7SWF5Y5L54SNNN3XMGV4TANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>

schatzlab / genomescope

Genome size underestimated comparing assembled data #100