schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
250 stars 56 forks source link

only 1 peak in a heterozygous diploid #98

Closed sallycylau closed 1 year ago

sallycylau commented 1 year ago

Hi there

I am having trouble fitting genomescope2 to my HiFi data and I am hoping if you could point me to the right direction on how to proceed.

The genome size estimation is what I expected for my species. The linear plot is showing the full model overlapping with errors, but the transformed_linear_plot is much clearer in presenting the data. So I can see that this is a diploid with high heterozygosity (which makes sense as this is a wild marine invertebrate specimen). But for some reason genomescope isn't indicating the data is characterised by high heterozygosity, and that the full model is only showing 1 peak. I wonder what might be causing this issue? Thank you!

linear_plot transformed_linear_plot

property min max
Homozygous (aa) 98.5315% 100%
Heterozygous (ab) 0% 1.46848%
Genome Haploid Length 1,671,603,854 bp 3,366,896,208 bp
Genome Repeat Length 878,309,778 bp 1,769,066,190 bp
Genome Unique Length 793,294,076 bp 1,597,830,018 bp
Model Fit 46.958% 92.6345%
Read Error Rate 0.0826566% 0.0826566%

mschatz commented 1 year ago

Hi, Thanks for your interest. This is a fairly poor fit - can you send the histogram file or link to the genomescope webpage with your results?

Thanks! Mike

On Fri, Jun 9, 2023 at 3:26 AM Sally Lau @.***> wrote:

Hi there

I am having trouble fitting genomescope2 to my HiFi data and I am hoping if you could point me to the right direction on how to proceed.

The genome size estimation is what I expected for my species. The linear plot is showing the full model overlapping with errors, but the transformed_linear_plot is much clearer in presenting the data. So I can see that this is a diploid with high heterozygosity (which makes sense as this is a wild marine invertebrate specimen). But for some reason genomescope isn't indicating the data is characterised by high heterozygosity, and that the full model is only showing 1 peak. I wonder what might be causing this issue? Thank you!

[image: linear_plot] https://user-images.githubusercontent.com/60337082/244620824-d95aaae1-5b1a-4bc8-9199-20e9a14c4458.png [image: transformed_linear_plot] https://user-images.githubusercontent.com/60337082/244620856-8369274c-70b7-4c71-80fc-40efd9aeeed5.png

property min max Homozygous (aa) 98.5315% 100% Heterozygous (ab) 0% 1.46848% Genome Haploid Length 1,671,603,854 bp 3,366,896,208 bp Genome Repeat Length 878,309,778 bp 1,769,066,190 bp Genome Unique Length 793,294,076 bp 1,597,830,018 bp Model Fit 46.958% 92.6345% Read Error Rate 0.0826566% 0.0826566%

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/98, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343GYG7SU74SCSK65HLXKLFZJANCNFSM6AAAAAAZAIISSM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sallycylau commented 1 year ago

Hi Mike

Thanks so much for the quick reply. Here is the genomescope link (http://genomescope.org/genomescope2.0/analysis.php?code=tC8i3MBprfe3k57Irbbz). .hist file was generated with meryl. Any tips for how to fit the model would be much appreciated. Thank you!

mschatz commented 1 year ago

Hi Sally,

I was able to get a slightly better fit here: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=T9COTpC3AJlsDduOF6rw

The only change was to provide a hint on the "Average k-mer coverage for polyploid genome" which I set to 40x. This gives much higher heterozygosity (3.04%) although the genomesize is cut in half (this is expected because I effectively changed which peaks are the homozygous and heterozygous values). Do you know if 3% heterozygosity and this genome size is realistic? The fit is still relatively poor and doesnt do a great job capturing all of the kmers with coverage around 20x. Is this from a single individual or where multiple samples sequenced together? Else you might have a bit of contamination present - have you run an assembly yet? Have you tried screening it with kraken?

Good luck!

Mike

On Mon, Jun 12, 2023 at 3:31 AM Sally Lau @.***> wrote:

Hi Mike

Thanks so much for the quick reply. Here is the genomescope link ( http://genomescope.org/genomescope2.0/analysis.php?code=tC8i3MBprfe3k57Irbbz). .hist file was generated with meryl. Any tips for how to fit the model would be much appreciated. Thank you!

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/98#issuecomment-1586747024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP344ZU2MJRZA5FBPQ5A3XK3ATPANCNFSM6AAAAAAZAIISSM . You are receiving this because you commented.Message ID: @.***>

sallycylau commented 1 year ago

Hi Mike

Thanks for your tips and sorry for the late follow up. I had to drop this project for a while and am now catching back up!

In your supplied link... I think setting "Average k-mer coverage for polyploid genome" to 40x which resulted in 3% het and 1.3 GB genome actually makes some sense....

I went ahead and assembled the genome with pretty aggressive purging (hifiasm -l3 + purged_dup) as this was found to give me the best completed assemblies (judging by N50 and contig #, and Complete BUSCO didn't vary much by changing hifiasm -l. Purged_dups definitely needed for duplications). I ended with a primary assembly of 1.8GB with BUSCO eukaryota C:96.90 [S: 91.80, D:5.10], F:1.2, M:1.9; and an alternate assembly of 1.3GB.

I ran Kraken on the purged assembles and 98.89% came back unclassified for primary and 99.97% for alternate, so I think contamination isn't the main issue here? We definitely only sequenced a single individual. This is a wild Antarctic brittle star from a place that receives a lot of current and gene flow from other places.... not ideal I am aware!

I ran smudgeplot as well and it does show this species as a diploid (but very heterozygous)

smudgeplot_smudgeplot

I tried to run Genomescope again with "Average k-mer coverage for polyploid genome" set as 30x which the estimated genome size came back as 1.9GB which matches the value in my purged primary assembly. However the fit is still quite poor. http://genomescope.org/genomescope2.0/analysis.php?code=PAA93V8MApgPGmKlPZrh

I wonder if there is still any way I could improve my Genomescope results? Thank you!

Best regards Sally