marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Parameters for triploid HIFI #2021

Closed ptranvan closed 3 years ago

ptranvan commented 3 years ago

Hi I am working on a triploid species which is highly heterozygous, see genomescope plot

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=nnC4CPmgLE3605rbyM7y

My data is Hifi and I used -pacbio-hifi . The assembly has been deduplicated using purge_haplotigs but the final size (400 mb) is way bigger than the genomescope estimated size (- 150 Mbp, which I think it's correct) and the BUSCO duplicate score is high.

From what I understand, this could be due to haplotype switching, with some contigs that could be partially duplicated ... Considering this issue I have tried to work with unitigs instead. After purge_haplotigs, the assembly size is a bit better (212 Mb) but still a lot of duplicates.

I would like to know if you have alreeady faced this situation and if you have other recommendations for parameters regarding my species ? thanks.

skoren commented 3 years ago

Given the genome is triploid and each haplotype is 142mb then 400mb right about the expected assembly size if the haplotypes aren't collapsed. Given the high diversity, I expect most contigs are already a single haplotype (as long as the spacing between variants is less than the HiFI read length the contig will be phased). The contigs aren't duplicated, they are representing the full genome in your sample rather than picking a single arbitrary haplotype. So the assembly isn't a problem and there's no reason to use unitigs instead of contigs.

We typically use purge_dups not purge_haplotigs so I'm not sure how to adjust parameters for purge_haplotigs. For purge_dups you often need to adjust the thresholds for purging. See discussions #1814 and dfguan/purge_dups#38 for more discussion on how to chose these thresholds. You could also try running BUSCO and purge_dups on the HiCanu contigs after removing any bubbles (the def line will say suggestBubble=true) as that may create clear peaks and make it easier for purge_dups to pick thresholds.

btrainee commented 3 years ago

most contigs are already a single haplotype

hi, how to get unitigs instead of contigs if i want.

brianwalenz commented 3 years ago

You can't. The unitigs that Canu used to make were not true unitigs and we instead focused on detecting and fixing misassembled contigs.

ptranvan commented 3 years ago

Thanks for your advice @skoren. CANU (with bubble filtered) + purge_dups worked quite well.

I am wondering what is the specificity of HiCanu compared to the other hifi assemblers (such as ipa or hifiasm - that didn't worked well even after an other round of purge dup or purge haplotigs) ?

Do you have an idea about it ?

skoren commented 3 years ago

I don't have experience running IPA but I expect hifiasm and hicanu to be similar in terms of how much phasing they can do with hifi data alone.

ptranvan commented 3 years ago

So I am trying to find an explanation. HiCanu mostly tend to retain haplotypes right (and not collapsing them) ?

skoren commented 3 years ago

All HiFi assemblers end up retaining haplotypes, hifiasm and IPA do too. The difference is hifiasm has a built-in purge_dups procedure that Canu does not but so it tries to remove the retained haplotypes after the assembly.