Closed jacopoM28 closed 2 months ago
The report is a bit confusing here, the pre-consensus lengths are in homopolymers-compressed space while post-consensus they are not. It's normal to see a 1.4x inflation going from compressed to uncompressed space so the size change seems normal.
It's quite possible for both hicanu and hifiasm to leave haplotype duplication that is too diverged or structurally different in the primary assembly. I think the genome is not very homozygous when evaluated with HiFi data which normally produces a 6gb assembly for human genomes. I suggest running purge-dups (https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help) and see if the genome is more in line with the expectation after that. You've already got 300mb of alt so if purge_dups removes another 100-150mb you'd end up with two very similar sized haplotype assemblies.
Dear Canu developers,
I am working on a diploid insect genome sequenced with HiFi reads on a Sequel IIe platform. Genomescope estimated a genome size of 380Mb after considering also highly abundant kmers (max kmer count of 5,000,000) and low levels of heterozygosity.
The genome appears to be quite repetitive, and preliminary analyses on the reads revealed that 24% could be composed of a single tandem repeat family.
Canu version 2.2 was installed via Conda and run with default settings:
canu -p Fpar_Canu_asm -d . genomeSize=380000000 -pacbio-hifi
The final assembly size without considering bubbles (566Mb) was much greater than the genome scope estimation, and the same tandem repeat previously identified in the reads composed 40% of the genome. Similar results were also obtained with HiFiasm.
Upon deeper inspection of the Canu report, it seems that the consensus step greatly increased the assembly size compared to the UNITIGGING/CONTIGS step. is this normal?
Considering the low genome-wide heterozygosity but the apparent huge coverage of a single tandem repeat family, is it possible that the tandem repeat arrays are being artificially extended due to extreme haplotypic variations within these regions?
Here the complete Canu report:
Thank you in advance for your assistance!
Jacopo