Open paul-bio opened 1 year ago
Thanks for your interest. There are two well defined peaks so it assumes those are the heterozygous and homozygous kmers at 90x and 180x coverage, respectively, with a high rate of heterozygosity (2.25%). But if your sample is inbred or generally has low rates of heterozygosity (~0.1% or less) the peak at 90x coverage could really be the homozygous peak with an insignificant peak at 45x for heterozygous kmers and the peak at 180x is really from repeats. If so, this would double the overall genome size to match the reported c-value. Id recommend you run BUSCO on your HiFi assembly and see if you have extensive amounts of duplicated genes - if you do, then you probably have a highly heterozygous sample and the GenomeScope estimate is reliable. If you dont have extensive duplicated genes, we will need to optimize GenomeScope for your data (there are some parameters that can be tuned for this)
Hope this helps
Mike
On Mon, Jun 19, 2023 at 1:35 AM paul-bio @.***> wrote:
Hi.
When I performed a de novo assembly using HIFI, I obtained a genome size of 1.7 Gb. Since there was no prior information available about my sample, I decided to estimate the genome size. When I look up the Animal Genome Size Database (http://www.genomesize.com), I found the C-value to be 1.93. Based on this information, I calculated the genome size to be 1.93 pg
- 978 Mb, which is 1.87 Gb.
I recently performed short-read sequencing to estimate the genome size using a k-mer based approach. However, when I used GenomeScope2 with a 21-mer analysis, the haploid genome size was identified as 939 Mb, which is half the size of the estimated genome size. [image: image] https://user-images.githubusercontent.com/73869977/246734088-c45cd01f-d0f9-4508-8023-4cdceb41138a.png
Is there any suggestion?
— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3475UNIONBZLAQGNUNDXL7QLXANCNFSM6AAAAAAZLNFMOM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi @mschatz , thanks for the fast feedback. I appreciate that.
Given that my target sample is small, we gathered individuals (5 for short-read sequencing, 20 for HIFI sequencing), none of which are inbred lines.
Regarding the assembled results, the BUSCO analysis yielded 95.8%, 82.1%, and 13.7% values for the C, S, and M, respectively. My sample appears to contain some duplicated genes, but the duplication does not appear to be that extensive.
My colleague also obtained a similar genome assembly size of 1.9G from their data. Mine is 1.69G.
If my assembly appears satisfactory, could you provide some suggestions for optimization?
Can you send the hist file or link to genomescope results?
Mike
On Tue, Jun 20, 2023 at 12:52 AM paul-bio @.***> wrote:
Hi @mschatz https://github.com/mschatz , thanks for the fast feedback. I appreciate that.
Given that my target sample is small, we gathered individuals (5 for short-read sequencing, 20 for HIFI sequencing), none of which are inbred lines.
Regarding the assembled results, the BUSCO analysis yielded 95.8%, 82.1%, and 13.7% values for the C, S, and M, respectively. My sample appears to contain some duplicated genes, but the duplication does not appear to be that extensive.
My colleague also obtained a similar genome assembly size of 1.9G from their data. Mine is 1.69G.
If my assembly appears satisfactory, could you provide some suggestions for optimization?
— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598112813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP347B3SXRQVAKVISM2ZLXMEUARANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>
Here is the genomescope result http://genomescope.org/genomescope2.0/analysis.php?code=XWpIeAgeg4tXJEIO5QSA
And for the result generated by using -h 1,000,000 options http://genomescope.org/genomescope2.0/analysis.php?code=TsqhKbj9tLFu3fd7k4Jm
When consider the results above, I think cutoff limit of 1,000,000 histo version seems better for me.
And it would be better for me to get genome size of 1.6-1.9 Gb. Thank you @mschatz
For low heterozygous samples, you will need to use the "Average k-mer coverage for polyploid genome" parameter as a hint for which peak is really the heterozygous kmers vs homozygous kmers. Here I set it to 45x and get an estimate of 1.87Gbp (with essentially zero heterozygosity): http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=GhVo8JXrNd5mSkqPJxsl
Using the full distribution, the estimated genome size rises to 2.3Gbp but there is a suspicious rise in the kmer coverage for very high frequency kmers. These are almost always technical issues (contamination, poor quality low complexity reads, etc), so would expect the truth to be closer to 1.9Gb than 2.3Gb.
http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=aJT8KN5XeaBzWsw47ZxN
Hope this helps! Mike
On Tue, Jun 20, 2023 at 9:40 AM paul-bio @.***> wrote:
Here is the genomescope result
http://genomescope.org/genomescope2.0/analysis.php?code=XWpIeAgeg4tXJEIO5QSA
And for the result generated by using -h 1,000,000 options
http://genomescope.org/genomescope2.0/analysis.php?code=TsqhKbj9tLFu3fd7k4Jm
When consider the results above, I think cutoff limit of 1,000,000 histo version seems better for me.
And it would be better for me to get genome size of 1.6-1.9 Gb. Thank you @mschatz https://github.com/mschatz
— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598814046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342FY3KW4SMD3R3KU4LXMGR47ANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>
Wow, this seems fantastic to me. Thanks again for your help :) I really appreciate it.
In some cases, when the heterozygosity is low, there can be an insignificant peak. And in my case, you mentioned that it can be 45. How did you determine that the heterozygous peak is at 45x and the homozygous peak is at 90x?
By eye, the tallest peak is at about 90x coverage. If we assume these are the homozygous kmers then the heterozygous ones will be half this coverage. Note you dont need to input the exact value, GenomeScope will take it as a "hint" to find the best values around the hint. Here is actually finds the heterozygous kmers have a coverage of 44x and the main peak thus has 88x
Good luck
Mike
On Tue, Jun 20, 2023 at 10:14 AM paul-bio @.***> wrote:
Wow, this seems fantastic to me. Thanks again for your help :) I really appreciate it.
In some cases, when the heterozygosity is low, there can be an insignificant peak. And in my case, you mentioned that it can be 45. How did you determine that the heterozygous peak is at 45x and the homozygous peak is at 90x?
— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/100#issuecomment-1598879668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342O7SWF5Y5L54SNNN3XMGV4TANCNFSM6AAAAAAZLNFMOM . You are receiving this because you were mentioned.Message ID: @.***>
Hi.
When I performed a de novo assembly using HIFI, I obtained a genome size of 1.7 Gb. Since there was no prior information available about my sample, I decided to estimate the genome size. When I look up the Animal Genome Size Database (http://www.genomesize.com), I found the C-value to be 1.93. Based on this information, I calculated the genome size to be 1.93 pg * 978 Mb, which is 1.87 Gb.
I recently performed short-read sequencing to estimate the genome size using a k-mer based approach. However, when I used GenomeScope2 with a 21-mer analysis, the haploid genome size was identified as 939 Mb, which is half the size of the estimated genome size.
Is there any suggestion?