schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
260 stars 56 forks source link

tetraploid genome result interpretation #141

Open Liyong-Zhang opened 2 months ago

Liyong-Zhang commented 2 months ago

Hi there,

I am using GenomeScope2 to check the heterozygosity rate of a plant genome (2n=28) with HiFi reads.

The initial assembly with Hifiasm was used for running a mummerplot with A. thaliana genome as reference, this plant looks like a tetraploid. mummerplot_v6

The command for checking the heterozygous rate is genomescope.R -i reads_fasta.histo -o ./ -p 4 -k 21 -n "p4"

The results are p4_summary.txt

p4_linear_plot p4_log_plot p4_transformed_linear_plot p4_transformed_log_plot

I have trouble understanding the results. What's the overall heterozygosity rate? 7.85371%?

Also, since the two haplotypes rate (aabc and abcd) are very low aabc 0.001% abcd 0.0121%.

Could I treat this plant as a diploid when using Hifiasm for assembly given that Hifiasm doesn't fully support polyploid genome yet (https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported).

Thank you so much for your help!

mschatz commented 1 month ago

From the shape of the plot this indeed looks to be a tetraploid sample with 4 defined peaks. Most of the heterozygosity is in the aabb context (8.32%) so I would guess this is an allotetraploid through a hybridization of two diploid genomes that separately had relatively low rates of heterozygosity (~.001% to .454%). If you run Busco on your assembly, do you see that most core genes are represented 4 times? When you align those copies, do you see they have high similarity?

For the assembly, yes I would try running hifiasm in diploid mode - I would expect this to work reasonably well although might get confused from the residual heterozygosity across the subgenomes. If you can generate Hi-C data, that can potentially be used to improve the assembly, e.g. https://www.nature.com/articles/s41477-019-0487-8

Good luck!

Mike

On Wed, Sep 25, 2024 at 12:25 PM Liyong @.***> wrote:

Hi there,

I am using GenomeScope2 to check the heterozygosity rate of a plant genome (2n=28) with HiFi reads.

The initial assembly with Hifiasm was used for running a mummerplot with A. thaliana genome as reference, this plant looks like a tetraploid. mummerplot_v6.PNG (view on web) https://github.com/user-attachments/assets/143621b8-ebce-476a-9a8b-f93099255bbc

The command for checking the heterozygous rate is genomescope.R -i reads_fasta.histo -o ./ -p 4 -k 21 -n "p4"

The results are p4_summary.txt https://github.com/user-attachments/files/17134674/p4_summary.txt

p4_linear_plot.png (view on web) https://github.com/user-attachments/assets/275fafcc-cef7-4f64-99ae-4aded40b0c3a p4_log_plot.png (view on web) https://github.com/user-attachments/assets/ff307c13-7044-4792-bc4a-a7bde6a39527 p4_transformed_linear_plot.png (view on web) https://github.com/user-attachments/assets/98f5eb6a-0e01-4398-a67b-838ea7f6932b p4_transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/6065c534-002a-4e5b-884e-c286c3ce7a9e

I have trouble understanding the results. What's the overall heterozygosity rate? 7.85371%?

Also, since the two haplotypes rate (aabc and abcd) are very low aabc 0.001% abcd 0.0121%.

Could I treat this plant as a diploid when using Hifiasm for assembly given that Hifiasm doesn't fully support polyploid genome yet ( https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported ).

Thank you so much for your help!

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/141, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34YVK72V3YXDXXP7HYDZYLPWBAVCNFSM6AAAAABO25KSXKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2DQNBTGI3DEMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Liyong-Zhang commented 1 month ago

Dear Mike,

Thank you for the detailed interpretation of the results. That's very good to know. Looks like it's a tetraploid, just not sure whether it's auto or allo.

I already run Busco with my initial assembly through Hifiasm along with my HiC data. However, for the core genes, there are a lot of them only replicated 2 or 3 times (full_table.txt). It's very confusing. I am trying to troubleshoot right now.

Thank you! Liyong

mschatz commented 1 month ago

I think that makes sense for when the genome is tetraploid, but you have pairs of chromosomes that are quite similar. For example, lets label the 4 chromosomes as A1, A2, B1, B2 where A1 & A2 are similar and B1 & B2 are similar. Then the assembler is likely to assemble A1+A2 together into a single contig and B1+B2 into a single contig so the core genes will often be represented in 2 copies. But across the whole genome sometimes there will be some genes that have more variation in A1 vs A2 or B1 vs B2 so you can end up with 3 or even 4 copies. On the other hand, highly conserved genes will be more similar across all 4 chromosomes so you will just end up with a single copy. It is definitely a tricky situation!

Mike

On Tue, Oct 1, 2024 at 10:27 PM Liyong @.***> wrote:

Dear Mike,

Thank you for the detailed interpretation of the results. That's very good to know. Looks like it's a tetraploid, just not sure whether it's auto or allo.

I already run Busco with my initial assembly through Hifiasm along with my HiC data. However, for the core genes, there are a lot of them only replicated 2 or 3 times (full_table.txt https://github.com/user-attachments/files/17219023/full_table.txt). It's very confusing. I am trying to troubleshoot right now.

Thank you! Liyong

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/141#issuecomment-2387510772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342NALCHC6C3KXO5YE3ZZNKX3AVCNFSM6AAAAABO25KSXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBXGUYTANZXGI . You are receiving this because you commented.Message ID: @.***>

Liyong-Zhang commented 1 month ago

Thanks, Mike. That does make sense. Unfortunately, it's indeed a really tricky situation. I am trying to tweak the parameter of Hifiasm to see whether it could improve the assembly. The previous scaffolding was through Yahs. I am also considering trying new tools like (https://pubmed.ncbi.nlm.nih.gov/39287126/).

mschatz commented 1 month ago

It is a hard problem. You might need to generate ultralong nanopore reads to fully resolve it. I know the Verkko team has been working on extending the algorithm for higher ploidy

Good luck! Mike

On Wed, Oct 2, 2024 at 11:10 PM Liyong @.***> wrote:

Thanks, Mike. That does make sense. Unfortunately, it's indeed a really tricky situation. I am trying to tweak the parameter of Hifiasm to see whether it could improve the assembly. The previous scaffolding was through Yahs. I am also considering trying new tools like ( https://pubmed.ncbi.nlm.nih.gov/39287126/).

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/141#issuecomment-2390420757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP345KMD3EOXFALKCN7FLZZSYSNAVCNFSM6AAAAABO25KSXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJQGQZDANZVG4 . You are receiving this because you commented.Message ID: @.***>

Liyong-Zhang commented 1 month ago

Thanks Mike for the informative feedbacks! For sure, I will keep this option in mind.

Best regards, Liyong