Closed YPGG1234 closed 2 years ago
First, the N50 you quoted is based on the compressed pre-consensus contigs, the post-consensus N50 is 1.6mb. You've got pretty low coverage and relatively short reads so I think your result isn't too surprising. At 20x coverage you have <10x coverage per haplotype. HiFi data has uneven coverage in some contexts so it's likely less in many places. As for the diversity, I'm not sure your genome is as high diversity as you estimated. The k-mer plots don't show any secondary peak for coverage as you'd expect for a heterozygous genome though this might also be caused by low coverage. That likely explains why the asm isn't the full 4.6g in size. I'd run purge_dups and busco to estimate how complete the genome is and how much of the haplotypes is assembled.
The biggest improvement you can make to your assembly would be to increase coverage, you could do this by sequencing more or by running DeepConsensus which can increase the Q20 yield of existing cells.
Hello, skoren
Thanks for your prompt reply! The busco of contigs shows that it does have a lot of duplication (C:96.8%[S:30.6%,D:66.2%]), then I used purge_haplotigs with HiFi reads to purge the contigs and got ~2.3 Gb primary contigs and ~1.6 Gb haplotigs.
I estimated the heterozygosity by genomescope (I am sorry, the heterozygosity is 1.3% instead of 1.7%)
I see coverage 20X is enough for current hybrid methods from here, but I found the heatmap of sex chromosome which came from Hi-C based scaffolding contigs was messed up, may due to the relative low sequencing coverage or incomplete purge.
Do you have any suggestions? Thanks.
That 20x you're referencing is just the minimum. Less than this wouldn't get you a complete genome. However, continuity tends to increase until you get to 35-40x so typical projects target at least this much (https://github.com/human-pangenomics/hpgp-data).
As for the genome scope plot, I think it is over-estimating the heterozygosity. Compare it's model fit line (black) to the actual k-mer counts (blue). The true het peak is much lower and smoother than the modeled one. It's also estimating the genome size as only 1.7gb not 2.3g. So I wouldn't trust those estimates too much in this case.
I'm not sure what to make of your Hi-C plot, it could be consistent with a centromere in the middle across which interaction is less frequent or another biological structure (e.g. see the human X here: https://www.nature.com/articles/s41586-020-2547-7/figures/12). You'd want to validate the assembly using read alignments and other information as I suggested in #2084. As for what to do, your best option is increasing coverage. The one that doesn't require more sequencing is using deep consensus (https://github.com/google/deepconsensus) which can give a higher Q20 read yield from the same input so I'd probably start with that.
The Hi-C (coverage >50x) plot was made by using juicer+3d-dna+juicerbox pipeline.
I will follow your suggestions, thank you!
Hello,
Recently I have used HiCanu (v2.2) to assembly one mammal genome (genome size: ~2.3 Gb, heterozygosity: ~1.7%), I assembled this genome with default HiFi recommended parameters, but I found the asm.contigs.fasta is 4 Gb (far from the expected 4.6 Gb) and its continuity is very low (N50: 747140). Here is my assembly report:
Is this normal? Could you help me see where I'm going wrong? Thanks.