Closed guangtugao closed 2 years ago
Your bubbles aren't that large with respect to the assembly, 15%. Trio binning of the reads isn't perfect, there's likely to be some mis-binned reads which can cause bubbles. There may also be recurrent errors in HiFi which would lead to bubbles.
The assembly with HiFi data is very stringent when considering overlaps, they have to be perfect to be used. That means any differences will show up as bubbles. In general, the assembly with HiFi reads is more stringent than trio binning. In cases of very diverse genomes, I actually prefer to assemble the full dataset and then use the trio information afterwards to split a contig if needed. Most contigs will be fully phased with HiFi data assuming there are enough variants (e.g. more frequent than a HiFi read length).
Thank you, Sergey! Actually, this is a very diverse genome. We made a hybrid F1 individual from the parents of two subspecies. For the way you suggested, do you mean I first run canu using the sequences from both haplotypes, and then run canu -haplotype to separate the contigs? I think this is smart approach. Thanks a lot! Guangtu Gao
I mean run Canu with all reads and then run merqury (https://github.com/marbl/merqury) to identify haplotype blocks and split contigs if needed.
Thanks, Sergey!
Hello,
I ran hicanu (canu 2.2) in a cluster (USDA ARS ceres) with 67x hifi sequences for a 2.4 gb genome assembly. The command is:
canu -p bct -d canu_default genomeSize=2.4g \ batMemory=128 corMemory=64 cnsMemory=128 \ -pacbio-hifi ../haps/haplotype/haplotype-BCT.fasta.gz
You can see that the input reads are the output from trio-binning for one haplotype.
At the end I got the following results:
The number of contigs and their size make sense to me, but what I don't understand is why I got so many bubbles?
Thanks, Guangtu Gao