schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
250 stars 56 forks source link

Tetraploid genome size #95

Open Seajull opened 1 year ago

Seajull commented 1 year ago

Hi,

I have HiFI data of a tetraploid and i got a small problem with the genome size estimation. I expected a haploid size around 500 Mb (very rough estimation based on the Plant DNA C-values Database and also on the genome size of a close diploid cultivar) but the Kmer analysis estimated the haploid genome size at ~911 Mb, link to genomescope.

If i set up the max kmer coverage at x1,000 (link) i got a estimation at ~634 Mb which is more in line with what i expected but it seems to remove a lot of legit repeat.

I tried to remove mitochondria and chloroplast from the reads, but the kmer spectra and the estimation stay the same (~911 Mb).

Can i trust the estimation of ~911 Mb or is there something i'm missing ? Thanks

mschatz commented 1 year ago

Thanks for your interest. It looks like a clean fit without any major obvious issues. There are a few minor peaks around 1000x coverage that are probably the mt and cp genomes, but as you say it doesnt look like those are major contributors. It is possible there are other contaminants (especially bacterial or fungal) hidden in the data, but it seems unlikely they would make much of a contribution. So overall I would expect the haploid genome size to indeed be around 900Mbp. I would kickoff HiFiasm to see how it looks. Happy to discuss more once you have the initial assembly

Good luck! Mike

On Fri, Apr 14, 2023 at 10:08 AM Clément Bellot @.***> wrote:

Hi,

I have HiFI data of a tetraploid and i got a small problem with the genome size estimation. I expected a haploid size around 500 Mb (very rough estimation based on the Plant DNA C-values Database and also on the genome size of a close diploid cultivar) but the Kmer analysis estimated the haploid genome size at ~911 Mb, link to genomescope. http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=BC3X145cYoNg4lCW4ykq

If i set up the max kmer coverage at x1,000 (link http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=r0Ti7SV9NKCRCiiVmvaZ) i got a estimation at ~634 Mb which is more in line with what i expected but it seems to remove a lot of legit repeat.

I tried to remove mitochondria and chloroplast from the reads, but the kmer spectra http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=vBQ8cXi544fpf7nBdUFx stay the same.

Can i trust the estimation of ~911 Mb or is there something that I didn't see ? Thanks

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/95, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343GHLYG6BWETWPE6YTXBFK65ANCNFSM6AAAAAAW6PENKA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Seajull commented 1 year ago

Thanks for the reply !

I've done a assembly with hifiasm and got those results :

Assembly # Contigs Largest Contig Total length N50
Primary 708 132 991 633 1 416 097 408 58 830 830
Hap 1 915 89 106 950 1 729 854 892 42 005 481
Hap 2 358 70 055 398 1 694 400 085 39 557 181
Unitig 16 528 5 136 471 2 194 833 753 701 463

At first I thought that hap 1 and hap 2 got two pseudo haplotypes each which adds up to ~800-900 Mb per haplotype (confirmed by kmer analysis) but the unitig graph is only ~2,2 Gb which mean ~500-600 Mb per haplotype as expected