Closed ptranvan closed 3 years ago
Given the genome is triploid and each haplotype is 142mb then 400mb right about the expected assembly size if the haplotypes aren't collapsed. Given the high diversity, I expect most contigs are already a single haplotype (as long as the spacing between variants is less than the HiFI read length the contig will be phased). The contigs aren't duplicated, they are representing the full genome in your sample rather than picking a single arbitrary haplotype. So the assembly isn't a problem and there's no reason to use unitigs instead of contigs.
We typically use purge_dups not purge_haplotigs so I'm not sure how to adjust parameters for purge_haplotigs. For purge_dups you often need to adjust the thresholds for purging. See discussions #1814 and dfguan/purge_dups#38 for more discussion on how to chose these thresholds. You could also try running BUSCO and purge_dups on the HiCanu contigs after removing any bubbles (the def line will say suggestBubble=true) as that may create clear peaks and make it easier for purge_dups to pick thresholds.
most contigs are already a single haplotype
hi, how to get unitigs instead of contigs if i want.
You can't. The unitigs that Canu used to make were not true unitigs and we instead focused on detecting and fixing misassembled contigs.
Thanks for your advice @skoren. CANU (with bubble filtered) + purge_dups worked quite well.
I am wondering what is the specificity of HiCanu compared to the other hifi assemblers (such as ipa or hifiasm - that didn't worked well even after an other round of purge dup or purge haplotigs) ?
Do you have an idea about it ?
I don't have experience running IPA but I expect hifiasm and hicanu to be similar in terms of how much phasing they can do with hifi data alone.
So I am trying to find an explanation. HiCanu mostly tend to retain haplotypes right (and not collapsing them) ?
All HiFi assemblers end up retaining haplotypes, hifiasm and IPA do too. The difference is hifiasm has a built-in purge_dups procedure that Canu does not but so it tries to remove the retained haplotypes after the assembly.
Hi I am working on a triploid species which is highly heterozygous, see genomescope plot
http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=nnC4CPmgLE3605rbyM7y
My data is Hifi and I used
-pacbio-hifi
. The assembly has been deduplicated usingpurge_haplotigs
but the final size (400 mb) is way bigger than the genomescope estimated size (- 150 Mbp, which I think it's correct) and the BUSCO duplicate score is high.From what I understand, this could be due to haplotype switching, with some contigs that could be partially duplicated ... Considering this issue I have tried to work with unitigs instead. After
purge_haplotigs
, the assembly size is a bit better (212 Mb) but still a lot of duplicates.I would like to know if you have alreeady faced this situation and if you have other recommendations for parameters regarding my species ? thanks.