schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
238 stars 56 forks source link

Estimated genome size is half #132

Open gunjanpandey opened 1 week ago

gunjanpandey commented 1 week ago

I have assembled for a genome a "suspected" highly hetrozygous genome using hifi. The assembled genome size is 8.2G, which gives following BUSCO results.

Could you please help me understand how to analyse these results. And how to perform this analysis properly, as I believe, I am somehow getting the genome size estimation half of its real value?

image

I have run following for the genome size estimation in genomescope2 using paired-end illumina files.

meryl count k=19 output k19.meryl ${R1} ${R2} 
meryl histogram k19.meryl/ > 19_meryl.hist 
Rscript genomescope2.0/genomescope.R -i k19_meryl.hist -k 19 -o k19_genomescpe

and I get the following results summary image

And the graph image

@rahulvrane, thoughts?

mschatz commented 1 week ago

Can you send the link to the genomescope webpage with your results? Sometimes the automatic modeling process gets confused and needs a hint on how to fit the model.

And you report the assembled genome size was 8.2G - is this the total amount of sequence that was assembled? If so, the difference is explained by genomescope reporting the haploid genome size while the assembly size will be about twice this amount for highly heterozygous samples. This is because the two haplotypes will separate out, and cause the duplicate genes that you see in the BUSCO report. For example, for humans it reports the (haploid) genome size as 3Gbp while a phased assembly will be about 6Gbp.

Good luck!

Mike

On Wed, Jun 19, 2024 at 2:36 AM gunjanpandey @.***> wrote:

I have assembled for a genome a "suspected" highly hetrozygous genome using hifi. The assembled genome size is 8.2G, which gives following BUSCO results.

Could you please help me understand how to analyse these results. And how to perform this analysis properly, as I believe, I am somehow getting the genome size estimation half of its real value?

image.png (view on web) https://github.com/schatzlab/genomescope/assets/50389451/a03d1e86-16b3-4df2-b86e-5dca2c0caf1d

I have run following for the genome size estimation in genomescope2 using paired-end illumina files.

meryl count k=19 output k19.meryl ${R1} ${R2} meryl histogram k19.meryl/ > 19_meryl.hist Rscript genomescope2.0/genomescope.R -i k19_meryl.hist -k 19 -o k19_genomescpe

and I get the following results summary image.png (view on web) https://github.com/schatzlab/genomescope/assets/50389451/491cc940-8a9d-4228-aab2-bc447d82a257

And the graph image.png (view on web) https://github.com/schatzlab/genomescope/assets/50389451/22767229-74e3-48ea-91f5-02f7096a838c

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343PVH2X7BMFL6KX2MLZIERIDAVCNFSM6AAAAABJRMR5B2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DCNBUHEYDEMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

gunjanpandey commented 1 week ago

Thanks for a quick reply @mschatz

The website is giving me an error so I am uploading the file here. k19_meryl.zip it is for kmer length of 19, for 150 bp paired end Illumina library - same as for the screenshots above.