schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
258 stars 56 forks source link

alloploid genome size estimate #107

Open s-kyungyong opened 1 year ago

s-kyungyong commented 1 year ago

Hello!

In the paper, it is mentioned that "The current best assembly spans 15.34 Gbp, while the GenomeScope estimate is 14.1 Gbp (see Supplementary Figs. 21 and 22)". Fig. S21, GenomeScope estimated a haplotype size as 2.3545G, and this gives a genome size of 14.1 Gb (2.3545 * 6) for the hexaploid genome (AABBDD). However, I thought that the reference wheat genomes only have three representative haplotypes as each subgenome (e.g., AA) is homozygous and collapsed during the assembly, and the estimated genome size for 15.3 Gb is for ABD and not for AABBDD.

I also have a allotetraploid genome with a final genome size of 10.5 Gb (AB collapsed from AABB due to very low heterozygosity). GenomeScope estimated the haplotype size as 2.64 Gb, and this gives 2.64 x 4 = 10.56 Gb for AABB. The estimate looks right, but the calculation wasn't so clear to me. Perhaps, I am misunderstanding something! Could you clarify this for me please?

Thank you so much! Kyungyong

mschatz commented 1 year ago

Thanks for writing. Genomes (or subgenomes) with very low heterozygosity are always tricky because it can be hard to interpret what the different peaks represent. Looking at the wheat data again (especially supp fig 21) I think we may have misinterpreted the peaks. Now I think the peak at ~50x comes from kmers that are homozygous in each subgenome and not heterozygous kmers. In retrospect we should have imposed a "peak" at 25x to capture heterozygous kmers (of which there are very few). This would have the effect of expanding the haploid genome size (1C) to ~4.6Gb rather than 2.34G, and imply we should multiply 4.6Gb x 3 to get the 1C genome of ~13.8GB which is close to the published assembly and other previous reports from flow cytometry.

For your data, your analysis sounds right - if the haplotype size is 2.64, then you should multiply by 4 to get 10.56Gb for AABB. If you send the link to the genomescope output I can take a look to confirm

Good luck

Mike

On Thu, Aug 24, 2023 at 2:01 PM skyungyong @.***> wrote:

Hello!

In the paper, it is mentioned that "The current best assembly spans 15.34 Gbp, while the GenomeScope estimate is 14.1 Gbp (see Supplementary Figs. 21 and 22)". Fig. S21, GenomeScope estimated a haplotype size as 2.3545G, and this gives a genome size of 14.1 Gb (2.3545 * 6) for the hexaploid genome (AABBDD). However, I thought that the reference wheat genomes only have three representative haplotypes as each subgenome (e.g., AA) is homozygous and collapsed during the assembly, and the estimated genome size for 15.3 Gb is for ABD and not for AABBDD.

I also have a allotetraploid genome with a final genome size of 10.5 Gb (AB collapsed from AABB due to very low heterozygosity). GenomeScope estimated the haplotype size as 2.64 Gb, and this gives 2.64 x 4 = 10.56 Gb for AABB. The estimate looks right, but the calculation wasn't so clear to me. Perhaps, I am misunderstanding something! Could you clarify this for me please?

Thank you so much! Kyungyong

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343YAJCDUWLBTWKN4TLXW6JHVANCNFSM6AAAAAA35MQCJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

s-kyungyong commented 1 year ago

Hi Mike,

Thank you for checking that! The situation I have for my genome is actually the same as the case of wheat genome. My genome is highly homozygous and thus collapsed to produce AB rather than AABB. This means that each haplotype is not 2.64Gb estimated by GenomeScope but ~5.0Gb. I will attatch the genomescope outputs here.

GenomeScope_output.zip

Thank you! Kyungyong

mschatz commented 1 year ago

Thanks! Your analysis seems reasonable. If you share the histogram (or link to genomescope) I can try adjusting the parameters to get this value automatically

Good luck! Mike

On Fri, Sep 8, 2023 at 8:28 PM skyungyong @.***> wrote:

Hi Mike,

Thank you for checking that! The situation I have for my genome is actually the same as the case of wheat genome. My genome is highly homozygous and thus collapsed to produce AB rather than AABB. This means that each haplotype is not 2.64Gb estimated by GenomeScope but ~5.0Gb. I will attatch the genomescope outputs here.

GenomeScope_output.zip https://github.com/schatzlab/genomescope/files/12564270/GenomeScope_output.zip

Thank you! Kyungyong

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/107#issuecomment-1712358570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP345P5MTVEZK55V6PZT3XZOZZRANCNFSM6AAAAAA35MQCJA . You are receiving this because you commented.Message ID: @.***>

s-kyungyong commented 1 year ago

Hi Mike,

Here is the histrogram! These were the commandlines I used: reads.histo.zip

jellyfish count -C -m 21 -s 50000000000 -t 20 Kronos.HiFi.fastq -o kmer_counts.jf
jellyfish histo -h 5000000 -t 20 kmer_counts.jf > reads.histo
genomescope2 -p 4 -i reads.histo -o genomescope --verbose  

Thank you! Kyungyong