How to improve CCS assembly when using 2.0 version?

WTarabidopsis commented 4 years ago

Hi, I got two phenomena when using Canu and hope you guys could help me solve the problems. It's about different Canu versions and different PacBio sequencing methods. The first is when I using PacBio CLR reads for assembly, the contig N50 from the newest version (Canu 2.0) is almost as twice as the older version (1.7), which is impressive. But when I incorporated the results with Hi-C data, the final result of the 2.0 version was very disordered, resulted in a lot of fragments, while the 1.7 version seemed OK and gave me a nearly chromosome-level assembly. The second is about CCS reads. Canu2.0 was used for my assembly and I got a very long assembly result (Contig N50 50 Mb). But when I combined Hi-C data, again, it resulted in a mess (Contig N50 1.0 Mb). BTW, the default parameters were used for the assemblies, and 3d-DNA was used when combing Hi-C data. Do these because of the 2.0 version introduced some new features to increase the contig length? Could I change the parameters back? or do you have any suggestions to improve CCS reads assembly? Thanks in advance! Best, WTA

skoren commented 4 years ago

There are changes in the more recent versions that will build longer pseudo-haplotypes (that is a long contig which switches between maternal/paternal segments) rather than splitting them up.

It's possible this change is confusing 3d-dna. We've also seen that 3d-dna is very unstable with minor input changes leading to huge differences, especially in assembly splitting. It is quite possible it is introducing false-positive splits where the assembly is accuracy. You could check the HiFi or CLR coverage in regions where the assembly was split by 3ddna. If there is good support from the read data but 3ddna is splitting it up, it's definitely not an issue with the assembly.

In short, I wouldn't assume just because of issues with 3d-dna scaffolding that there is a problem with the Canu assembly. If you're not already, I'd use purge_dups to make a primary assembly and only try to scaffold that. I'd also suggest trying other HiC tools like Salsa besides 3ddna.

skoren commented 4 years ago

Any updates, did you try another scaffolding tool?

WTarabidopsis commented 4 years ago

Thanks for your reply. I also tried ALLHIC and LACHESIS pipeline for scaffolding, but neither of them could give the proper result. I solved this problem by using Flye + 3d-DNA. The contig N50 is only 3 Mb, but they could clearly anchor the contigs to 12 chromosomes as expected. I'm afraid that the new algorithm of Canu to build longer pseudo-haplotypes may introduce some misconnections between reads. Since my genome is nearly homozygous (estimated heterozygosity is 0.05%), the switch method may misconnect some reads. I guess for chromosome assembly, the contig accuracy may be more important than length. I don't know if we can use some more strict parameters in Canu to prevent the generation of very long contigs, I'll try -haplotype when I have time. I have tried Salsa for scaffolding by integrating with the old Canu version results before, the result was also not very good, with a lower anchor rate. So I didn't try Salsa this time. Anyway, I think finally I got some clues about my problem, thanks a lot.

skoren commented 4 years ago

Glad you found a way to scaffold. However, I don't think that explanation can be correct. Both Flye and Canu 1.7 will make some switches between haplotypes so the haplotype switches will be present in all three. A haplotype switch is also not a mis-connection, they are mitigated by large homozygous stretches, so those reads do in fact go next to each other in one version of the genome. A 0.05% heterozygosity is high relative to HiFi data, I expect your 2.0 assembly is almost double the genome size. Flye or Canu 1.7 will collapse 0.05% divergence down into a single genome in contrast (still a pseudo-haplotype but representing half the genome). That is likely the source of the issue. That's why we recommend purge_dups before scaffolding with Salsa or 3D-DNA.

WTarabidopsis commented 4 years ago

Glad you found a way to scaffold. However, I don't think that explanation can be correct. Both Flye and Canu 1.7 will make some switches between haplotypes so the haplotype switches will be present in all three. A haplotype switch is also not a mis-connection, they are mitigated by large homozygous stretches, so those reads do in fact go next to each other in one version of the genome. A 0.05% heterozygosity is high relative to HiFi data, I expect your 2.0 assembly is almost double the genome size. Flye or Canu 1.7 will collapse 0.05% divergence down into a single genome in contrast (still a pseudo-haplotype but representing half the genome). That is likely the source of the issue. That's why we recommend purge_dups before scaffolding with Salsa or 3D-DNA.

Thanks for your concern, but the total assembly of 2.0 is not the double of genome size. Actually, the total assembly of Canu2.0 and Flye is similar, a little smaller than the estimated genome size. I used --keep-haplotypes to prevent collapse between haplotypes during Flye assembling. Does -haplotype in Canu has a similar function?

skoren commented 4 years ago

No the --haplotype option is to bin reads using trio information. It is very strange you'd end up with not double the genome size as this is what we get for human genomes which are far more homozygous than yours. What is the exact canu command you are using?

WTarabidopsis commented 4 years ago

Just the default parameters: canu maxMemory=500G maxThreads=40 useGrid=false -assemble -p sp -d sp genomeSize=1200m -pacbio-hifi /MY_PATH/sp.ccs.fa.gz

skoren commented 4 years ago

That looks correct, can you share the data so I can try a local run? Otherwise the full report file and the unitigging/4-unitigger/*001.filterOverlaps.thr000.num000.log files?

WTarabidopsis commented 4 years ago

Sorry that I deleted the log files, but I have uploaded the HiFi reads to your FTP today (ftp.cbcb.umd.edu), the file name is sp1102.ccs.fa.gz, Hope it can help.

skoren commented 4 years ago

I downloaded and ran the dataset with Canu 2.0 and hifiasm which both produced assemblies of about 1.1 Gbp in size with NG50 > 50 Mbp. Looking at the heterozygosity, it's lower than 0.05% (http://qb.cshl.edu/genomescope/analysis.php?code=k4Rzvb8shb1QkCaWw3Dz) and it seems even the genomescope estimate is likely over-estimating the heterozygosity. Overall, this genome seems much more homozygous than typical human genomes I've looked at. Based on this, I don't think anything to do with pseudohaplotypes or heterozygosity could be the cause of the scaffolding issues. I do think running purge_dups is still warranted on the assembly since, based on the k-mers, the genome size is about 1-1.1g so are some alternate loci present.

I checked the two assemblies against each other and they are in very good agreement: out with the exception of one large structural difference. Looking at HiFi mapped read coverage outside of very short (<50kb) contigs, it is even, with only 0.35% of bases at <5x or >50x.

In conclusion, I don't see anything concerning in the HiFi assembly from 2.0. The reason 2.0 is so much more continuous than either 1.7 or Flye is because it was tailored to HiFi data. However, other assemblers tuned to HiFi data have similar continuity, supporting that the 2.0 asm is reasonable. Given the NG50 there isn't much left to scaffold I expect (95% of the assembly is in 30 contigs). For a reason I don't know, 3D-DNA doesn't work properly on this assembly. The cause seems to be 3D-DNA over-fragmenting a correct assembly not Canu making assembly errors. You may want to explore if 3D-DNA parameters can be adjusted to fix its incorrect behavior or ask the developers for advice.

marbl / canu

How to improve CCS assembly when using 2.0 version? #1692