schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
262 stars 56 forks source link

Help interpreting output #75

Open lpereira89 opened 2 years ago

lpereira89 commented 2 years ago

Hello,

I ran Genomescope 2.0 on two HiFi datasets for two plant genomes. In both cases I get unexpected results and I am unsure of how to interpret them.

Genome 1: http://genomescope.org/genomescope2.0/analysis.php?code=EUqq9F7CPpVQzDObKuPJ Genome size is about what I estimated by flow cytometry, but there is a tall error peak. The coverage is lower, though. Are these sequencing errors (shouldn't be that high in HiFi reads, I think)? Could it be due to sample contamination?

Genome 2: http://genomescope.org/genomescope2.0/analysis.php?code=1q4UixxtZ4acEF5YCQmQ Genome size is lower than what I estimated by flow cytometry. Also, there are only 20% unique seqs. Since I am not 100% sure of ploidy, I ran the model as if it was polyploid as well: http://genomescope.org/genomescope2.0/analysis.php?code=MrIaUcE0wQuyNdqlEcGA Can we infer ploidy from this analysis? Do the four peaks suggest tetraploidy?

Thank you for your help, Lara

mschatz commented 2 years ago

Hi Lara,

On Tue, May 10, 2022 at 3:31 PM Lara Pereira @.***> wrote:

Hello,

I ran Genomescope 2.0 on two HiFi datasets for two plant genomes. In both cases I get unexpected results and I am unsure of how to interpret them.

Genome 1:

http://genomescope.org/genomescope2.0/analysis.php?code=EUqq9F7CPpVQzDObKuPJ Genome size is about what I estimated by flow cytometry, but there is a tall error peak. The coverage is lower, though. Are these sequencing errors (shouldn't be that high in HiFi reads, I think)? Could it be due to sample contamination?

Yes, this looks weird to me. I would guess that first peak (around 10x coverage) is likely to be contamination of some kind. You could try assembling with hifiasm and then screening the contigs for those around this level. Then Id BLAST those contigs to see what they might be

Genome 2:

http://genomescope.org/genomescope2.0/analysis.php?code=1q4UixxtZ4acEF5YCQmQ Genome size is lower than what I estimated by flow cytometry. Also, there are only 20% unique seqs. Since I am not 100% sure of ploidy, I ran the model as if it was polyploid as well:

http://genomescope.org/genomescope2.0/analysis.php?code=MrIaUcE0wQuyNdqlEcGA Can we infer ploidy from this analysis? Do the four peaks suggest tetraploidy?

Yes, the 4 peaks strongly suggests tetraploid. The haplotypes are quite distinct so I would expect hifiasm to do a good job separating out the 4 haplotypes. Then I would try running BUSCO to look for core genes and expect most core genes will be in multiple copies (even though they are supposed to be just once).

Good luck!

Mike

Thank you for your help, Lara

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/75, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343D7X3JALRZ7SNJILLVJK2RNANCNFSM5VSTWAZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mrmrwinter commented 2 years ago

If I could tag onto this before it is closed, could you also explain the percentages of ploidy at the top of the plot, please?

Am I right to think that the "aaaa" at >90% indicates tetraploidy at a level of >90% of the length of the genome?

And also, is the distinction between "aaaa", "aabb", "aabc", etc, based on the ratios of observed levels of hetrozygosity?

Many thanks,

Mike