Follow up RE: Confounding GenomeScope Output

foreignsand commented 3 months ago

I submitted this issue previously here and am following up.

I have now performed kmer counts using Meryl on ccs output from my PacBio HiFi run for Q20, Q30, Q40, and Q50 reads. Ultimately, the initial count and histogram output from evaluating the Q20 ccs reads was the 'closest' to what would be expected, but at half the expected genome size, this is still not at all remotely close to what I would expect.

Could this be because this genome is expected to contain a large number of repeats? Anyone have any thoughts as to why this might be happening?

Here are the GenomeScope results:

GenomeScope.pdf

mschatz commented 3 months ago

This is estimating an extremely high level of heterozygosity (13%), which is >10 times larger than a typical Arabidopsis F1 and >100 times larger than the rate in humans (0.1%). Im guessing the model fit was confused here

If we give it a hint that the haploid coverage is closer to 4 or 5 (by setting Average k-mer coverage for polyploid genome to 4), it estimates the genome size to be about 2.1Gbp with very little heterozygosity. Is this more realistic?

http://genomescope.org/genomescope2.0/analysis.php?code=DX26czpevZxuwufPUMGo

The fit is still borderline, but that is largely because of the low coverage overall.

Hope this helps

Mike

On Tue, Apr 30, 2024 at 12:17 PM foreignsand @.***> wrote:

I submitted this issue previously here https://github.com/schatzlab/genomescope/issues/126 and am following up.

I have now performed kmer counts using Meryl on ccs output from my PacBio HiFi run for Q20, Q30, Q40, and Q50 reads. Ultimately, the initial count and histogram output from evaluating the Q20 ccs reads was the 'closest' to what would be expected, but at half the expected genome size, this is still not at all remotely close to what I would expect.

Could this be because this genome is expected to contain a large number of repeats? Anyone have any thoughts as to why this might be happening?

Here are the GenomeScope results:

GenomeScope.pdf https://github.com/schatzlab/genomescope/files/15167828/GenomeScope.pdf

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP344KW6JJOD737IHC74TY767ZDAVCNFSM6AAAAABHASTVE2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TCOJUHE4DSMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

foreignsand commented 3 months ago

Thank you for your response! That is helpful, although I'm a little confused about how the coverage would be low. You mean low k-mer coverage? Although I still find this confusing. We estimated the PacBio sequencing coverage to be around 80X, so I would think this would translate to k-mer coverage.

Would genome repeat content play a role? A recently released genome for a closely-related species consisted of 39.74% repeat sequences. I wonder what effect that would have here.

Cheers, Emily

mschatz commented 3 months ago

If the pacbio coverage was 80x, then yes, we would expect a peak around 80x. I wonder if something went wrong with the kmer counting. Can you try kmc or jellyfish on the full read set?

Hope this helps

Mike

On Thu, May 2, 2024 at 3:13 PM foreignsand @.***> wrote:

Thank you for your response! That is helpful, although I'm a little confused about how the coverage would be low. You mean low k-mer coverage? Although I still find this confusing. We estimated the PacBio sequencing coverage to be around 80X, so I would think this would translate to k-mer coverage.

Would genome repeat content play a role? A recently released genome for a closely-related species consisted of 39.74% repeat sequences. I wonder what effect that would have here.

Cheers, Emily

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/129#issuecomment-2091338626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP344DD52547UHVXDAH7LZAKF4PAVCNFSM6AAAAABHASTVE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJRGMZTQNRSGY . You are receiving this because you commented.Message ID: @.***>

foreignsand commented 3 months ago

I'll try that and see how that goes. Thanks!

Cheers, Emily

foreignsand commented 3 months ago

Sorry for the delay. There's a battle raging for cores and memory on my cluster.

I ran meryl on all subreads and the k21 output provided the most reasonable GenomeScope results I've been able to get:

genomescope_k21_allsubs.pdf

The genome size is in the ballpark—based on similar species we estimate it will be ~2.4 Gb—and even though the heterozygosity is high relative to what one would normally expect, this individual is from an incredibly isolated and inbred subpopulation of martens that lives in a small strip of coastal habitat. Initial population genetics analyses of a number of individuals from this subpopulation suggest that the heterozygosity may, in reality, be extremely high, surprisingly so, to the extent that we decided to perform whole genome assembly to verify if this is actually the case.

Anyway, I hope this is helpful... now I'm wondering if I should try whole genome assembly with all the subreads? I don't think this is what I'm supposed to do with hifiasm, but I can always give it a try and see what happens.

Thanks so much!

Cheers, Emily

schatzlab / genomescope

Follow up RE: Confounding GenomeScope Output #129