marbl / verkko

265 stars 27 forks source link

Verkko 2.0 run in HiFi + HiC mode for Philedonides lunana #248

Closed ksenia-krasheninnikova closed 1 month ago

ksenia-krasheninnikova commented 2 months ago

Hello,

I was trying to run the verkko 2.0 release on one of the Tree of Life insect genomes. The species is Philedonides lunana with estimated heterozygosity 2.0% and estimated genome size 282Mb. HiFi and HiC data were obtained from the same individual with HiFi coverage ~81X per haplotype and HiC coverage ~210X per haplotype. Apart from the LSF configuration the assembly was run with default settings:

verkko -d ilPhiLuna1.verkko.20240408 --lsf --snakeopts \"--cores 1000 --profile verkko_asm/verkko/lib/verkko/profiles/\" --hifi m84093_240308_124538_s1.ccs.bc2087.rmdup.filtered.fasta.gz --hic1 ilPhiLuna1.hic.1.fastq --hic2 ilPhiLuna1.hic.2.fastq

This run produced a total assembly.fasta of 570 Mb

SCAFFOLD    sum = 570,612,339, n = 2820, largest = 1,732,248, smallest = 3513
SCAFFOLD    N50 = 351,242, L50 = 515
CONTIG          sum = 570,512,235, n = 2822,, largest = 1,732,248, smallest = 3513
CONTIG          N50 = 351,242, L50 = 515
GAP         sum = 100,104, n = 2, mean = 50052, largest = 99,104, smallest = 1000

BUSCO score for this assembly is C:76.6%[S:24.1%,D:52.5%],F:1.5%,M:21.9%,n:1367

Also, the results contained haplotype1 assembly of 22 Mb:

sum = 22,479,309, n = 83, largest = 947,707, smallest = 16177
N50 = 459,182, L50 = 18

and haplotype2 assembly of 35 Mb

SCAFFOLD    sum = 35,728,586, n = 134, largest = 1,294,928, smallest = 6210
SCAFFOLD    N50 = 539,389, L50 = 23
CONTIG          sum = 35,628,482, n = 136, largest = 1,294,928, smallest = 6210
CONTIG          N50 = 539,389, L50 = 23
GAP         sum = 100,104, n = 2, mean = 50052, largest = 99,104, smallest = 1000

BUSCO completeness for both assemblies are 4.8% and 6.1% correspondingly.

Would you recommend to tweak any particular settings to achieve a better assembly contiguity and completeness and to improve its phasing?

Many thanks!

Ksenia

ksenia-krasheninnikova commented 2 months ago

Running it with --haplo-divergence 0.2 now to check if it makes a better phasing.

Dmitry-Antipov commented 2 months ago

Hi, Increasing --haplo-divergence is clearly the first thing to try. If it will not help - will need 8-hicPipeline/hicverkko.log and 8-hicPipeline/unitigs.hpc.noseq.gfa to start investigation.

Although verkko development was not focused on HiC+ HiFi combination (without ONT), I do not know reasons why it should fail on this combo, possibly tweaking some constant may improve your assemblies significantly.

ksenia-krasheninnikova commented 2 months ago

Hi @Dmitry-Antipov

Thanks!

This is the asm stats with --haplo-divergence 0.2:

assembly.fasta
SCAFFOLD    sum = 570,612,339, n = 2820, largest = 1,732,248, smallest = 3513
SCAFFOLD    N50 = 351,242, L50 = 515
CONTIG  sum = 570,512,235, n = 2822, largest = 1,732,248, smallest = 3513
CONTIG  N50 = 351,242, L50 = 515
GAP sum = 100,104, n = 2, mean = 50052, largest = 99,104, smallest = 1000

assembly.haplotype1.fasta
SCAFFOLD    sum = 34,410,505, n = 125, largest = 1,294,928, smallest = 6210
SCAFFOLD    N50 = 543,317, L50 = 22
CONTIG  sum = 34,310,401, n = 127, largest = 1,294,928, smallest = 6210
CONTIG  N50 = 543,317, L50 = 22
GAP sum = 100,104, n = 2, mean = 50052, largest = 99104, smallest = 1000

assembly.haplotype2.fasta
sum = 29,528,636, n = 98, largest = 1,183,681, smallest = 11749
N50 = 522,045, L50 = 20

Is unitigs.hpc.noseq.gfa the graph before or after HiC-based resolution? Can it be so that the graph (screenshot is attached) in this case doesn't contain enough haplotype information to make HiC phasing applicable at all?

The log and graph files for this run are here: unitigs.hpc.noseq.gfa hicverkko.log Screenshot 2024-04-22 at 09 43 12

Dmitry-Antipov commented 2 months ago

Graph is clearly the major problem- it is way too fragmented. We do not have graph after Hi-C, unitigs.hpc.noseq.gfa is just HiFi. And coverage is pretty high.. Did you try hifiasm for this sample? Is their graph similarly fragmented?

Hi-C phasing still can work here, but high graph fragmentation is likely the may reason for phasing to fail, we do not have enough long nodes we rely on. If we tweak some constants about node lengths, we'll definitely be able to phase more, but I'd really worry about quality of such phasing (and the assembly itself) because of such a huge amount of gaps.

ksenia-krasheninnikova commented 2 months ago

Yes, we've got hifiasm in HiC mode assembly for this genome:

ilPhiLuna1.hic.hap1.p_ctg.fa.gz
sum = 358,301,589, n = 2458, largest = 2,441,148, smallest = 7940
N50 = 349,182, L50 = 297

ilPhiLuna1.hic.hap2.p_ctg.fa.gz
sum = 330,337,029, n = 1806,  largest = 1,949,415, smallest = 7586
N50 = 379,502, L50 = 261

N50 is quite low here too and it's also visible from the assembly graph that the contigs are similarly fragmented. Eventually hifiasm ends up with more balanced total sizes for both haplotypes. However there are phasing issues in this hifiasm assembly - which is possible to say by looking at the joint HiC map. The idea was to try and get verkko running on the genomes with phasing issues like this one. We currently have hifiasm in production for all our genomes but would be nice to get verrko working as a possible alternative.

Dmitry-Antipov commented 2 months ago

We can try to phase better, there are some clear length-related constants to tweak. To do so I'll also need unitigs.hpc.fasta and hic.byread.compressed files from 8-hicPipeline folder. If the assembly is too big to be shared through github, we can use google drive - dmitrij.antipov@gmail.com there.

But anyway, such fragmentation does not look natural and can lead to lots of misassemblies/haplotype switches.. Are the hifi reads publicly available? Although hifiasm graph is fragmented, it seems to be less fragmented than verkko's and I have no ideas why.

ksenia-krasheninnikova commented 2 months ago

I have put the assembly files here:

53M https://darwin.cog.sanger.ac.uk/ilPhiLuna1.haplo_divergence.0.2/hic.byread.compressed 106M https://darwin.cog.sanger.ac.uk/ilPhiLuna1.haplo_divergence.0.2/unitigs.hpc.fasta.gz

the HiFi reads: 6.7Gb https://darwin.cog.sanger.ac.uk/ilPhiLuna1.m84093_240308_124538_s1.ccs.bc2087.rmdup.filtered.fasta.gz

Regarding the fragmentation of the graph - in case verkko inherited some of the hicanu error correction and graph built approaches then it may be related to what we've seen in hicanu and hifiasm runs for many of the genomes. The hicanu assemblies always tend to be more fragmented. In fact I can try runnig verkko without error correction or using the hifiasm error corrected reads as input just for an experiment.

Thanks very much for having a look into it! If there is a tweaked version of the assembly I can try and have a look if it does better phasing using the HiC data (can share the HiC reads too). Also, if there is a tweaked version of the package I'd be very happy to try it on some other genomes.

ksenia-krasheninnikova commented 2 months ago

I can confirm that running verkko without error correction on reads or using error-correction performed by hifiasm doesn't fundamentally change the assembly, including sizes of the phased haplotypes.

Also, I have to mention that the HiFi data for this dataset was obtained using ultra-low input protocol with resulting reads N50 9,789bp. ULI datasets usually tend to have relatively low contiguity - like we see here in both, verkko and hifiasm results. Although, of course, this doesn't explain differences between the two assemblers.

There is an ENA entry for ilPhiLuna1 - if perhaps you want to have a look at the HiC reads, also there is HiFi reads there (must be same as I shared above): https://www.ebi.ac.uk/ena/browser/view/PRJEB73965

Dmitry-Antipov commented 2 months ago

I've changed a bit the phasing algo, but still half of the assembly is not phased - grey nodes on figure attached. And this is clearly because of the fragmentation. So, this still do not look like a decent assembly to work with. Checking possible reasons of the fragmentation on our side.

P lunara

Dmitry-Antipov commented 2 months ago

I've decreased starting k value for the mbg from default 1K (--base-k 101 --window 70 was the most extreme, corresponding to the figure attached), and the graph is still pretty bad but waaay more continuous. Some of those connections have really low coverage, but I have no idea whether they are correct

So, I'd say that for this protocol the problem is not that resulting reads are short(N50 10K is pretty OK), but coverage and read breaks seems to be really uneven. I do not know what can we do to improve graph with such data, and without decent graph we are not able to do decent phasing.

mbg100

ksenia-krasheninnikova commented 2 months ago

When you say read breaks do you mean breaks in coverage along the genome?

Thank you for giving it a try!

Dmitry-Antipov commented 2 months ago

Yes. Possibly some regions were not amplified well?

I'm attaching kmer counting stats (hifi reads, meryl counter, k=63, homopolymer compressed) Although there are clear peaks around 80x ("haploid" coverage) and 160x coverage, there are lots of kmers with frequency more than 3 (so very likely not erroneous), but way less than haploid peak value.

63.stats.txt

ksenia-krasheninnikova commented 1 month ago

Yes, ULI data can have some sequencing drop-outs which have been observed in genomic regions that can fold in hairpin structures. I believe it can cause graph fragmentation in this case.

I don't think that the abundance of kmers around 3X and 4X is necessarily related to ULI protocol. It can also correspond to genomic repeats. The attached genomescope profile has pronounced peaks at higher coverages and it was build on a data produced with standard PacBio protocol for Dalopius marginatus

Dmitry-Antipov commented 1 month ago

3x(haploid coverage) and 4X are not a problem, I was speaking about coverage just 3 and higher. A bit strange that meryl's diagram (for k=63 but still) looked so different, with way more kmers not in the peaks but in the dip before haploid coverage peak. But anyway, if you know a biological reason for the coverage drop then it's likely the real problem, and we can not do anything on verkko's side.

Possibly a "diploid" Hi-C scaffolding (I mean doing both scaffolding and phasing simultaneously) can provide better scaffolds for such samples, but I do not know any existing tool to recommend.

ksenia-krasheninnikova commented 1 month ago

Thank you. We will probably try verkko on some other datasets in the future. Will keep in mind that it may perform better on the sequencing from standard PacBio libraries.