marbl / verkko

265 stars 27 forks source link

Questions related to interpretation results phased assembly. #245

Open rfinkers opened 3 months ago

rfinkers commented 3 months ago

I'm attempting an assembly (relatively heterozygous diploid genome; 5Gb haploid size; with Hifi, ONT & Hi-C).

In the directory 7-consensus, the uniting-popped fast = 7.3Gb; however the uniting.popped.haplotype1, haplotype2 and unsigned is 0 bytes in size.

Thus this imply that there is no contribution of the ONT data to the phasing; though there are reads in the ont_subset.fasta.gz?

The final assembly.fasta is still 7.3Gb; with the haplotype1 and 2 fast both being 1.3G in size. This phasing is the sole contribution from the Hi-C data?

What would be good parameters to re-atempt this assembly; to deal with the higher diversity in this species (compared to human)?

Tnx!

skoren commented 3 months ago

The initial assembly (the 7.3 Gbp) file is still phased, it's just not split into haplotypes. Each contig is from a single haplotype based on HiFi + ONT data information. In the second case, the assembly.fasta will still be the full assembly (haplotype1 + haplotype2 + unassigned) so it makes sense that it has not changed. It may have linked some sequences together within haplotypes that couldn't be linked without HiC, making them longer which would be reflected in sequence stats.

I suspect the issue is the assembly is almost completely phased already and there isn't anything for HiC to do except assign to a haplotype. This requires homology detection which may not be tolerating the diversity of this species vs human. You can try increasing the --haplo-divergence parameter from the default of 0.05 to 0.10 or 0.2 and see if more of the assembly is assigned. Perhaps @Dmitry-Antipov has other suggestions. If you can share the colors file and the noseq.gfa file here from the 8-hicPipeline folder that should give more info on what is happening with the assembly.

Dmitry-Antipov commented 2 months ago

Without graph and colors it is hard to say anything, but anyway, such large unassigned file is not normal. Increasing --haplo-divergence is the first idea for me too.

Which species is it? Do you expect to have large rDNA arrays, what is estimated level of heterozygosity?

rfinkers commented 2 months ago

share.zip Thanks @skoren , and yes, I was mixing-up terminology. See the files attached for feedback; I'l rerun the assembly with different --haplo-divergence settings and assess the impact on the final results.

Dmitry-Antipov commented 2 months ago

Aha, this is actually another problem. Graph nodes (which are constructed from ONT & HiFi but not HiC) are quite fragmented. This is not normal, those regions should be resolved with ONT reads. Without it there is just not enough HiC signal to assign labels for those short haplotype-specific regions - for some chromosome pairs there are thousands of nodes. We expect to have larger haploblocks with HiFi & ONT, and then phase/assign labels for them with HiC

The reason seems to be a verkko problem, but on the graph simplification/ONT alignment stages, and not on HiC phasing - I see huge amount of fake bulges with really different coverage between "haplotypes" - @skoren can you have a look? fake_bulges

--haplo-divergence should not actually change anything here.

Dmitry-Antipov commented 2 months ago

Yet another assumption - is it a cell line or a tissue sample? Possibly those numerous 5x vs 50x bubbles can represent problems with the cell line?

skoren commented 2 months ago

The genome seems to be a combination of quite diverge but then some extremely homozygous regions (like the one @Dmitry-Antipov posted above). In normal homozygous genomes those low coverage nodes would be simplified and you'd end up with a single large node but here, because it's double the coverage of most of the genome, they aren't. I do think that --haplo-divergence will assign more of the assembly but these very homozygous regions are likely to remain in the unassigned bin and probably belong in both haplotypes.

Dmitry-Antipov commented 2 months ago

yep, Sergey is right, I did not notice that in addition to that underresolved problem there are some unlabeled large nodes - those labels can be fixed with haplo-divergence. But it definitely will not help with the fragmented+unassigned problem i was writing about

rfinkers commented 2 months ago

Gentleman, thanks so far, there is definitively some food for thought here before providing some feedback (next week). In the meanwhile, I'll rerun the final steps with a larger haplotype divergence setting and take that also along int he feedback.

skoren commented 2 months ago

When you re-run with the halo-divergence change, I'd suggest also adding the --haploid option. You don't need to re-run the full pipeline from scratch, in fact it's better if you don't so you can keep the read/node correspondences to compare the results. This would mean duplicating your current asm folder and removing all the assembly. files along with the `5-, 6-, 7-, and 8-*` steps. That should better handle these very homozygous regions and pop those bubbles. It will likely still not assign them if the entire component is homozygous but at least the assembly will be less fragmented.

rfinkers commented 4 weeks ago

Still working on understanding this issue; Besides the example you picked out that part of the genome is relatively homozygous (on one chromosome), on other chromosomes we might have much higher diversity. Could diversity be to high, which results in failure to detect that two homolouges belong to the same chromosome, result in not exporting these sequences to the fased fasta file? As in, this probably would als impact your previous suggestion regarding the assembly as '--haploid'?

skoren commented 4 weeks ago

Right, there's two issues. One is there is high diversity in some regions and two is there is some very homozygous regions. The --haplo-divergence would help with the first one. Without proper haplotype phasing the HiC signal won't be there to push the two haplotypes away. The change I suggested (--haploid) would address the second one by more aggressively removing bubbles in this assembly, hopefully extending the sequences for Hi-C phasing.

rfinkers commented 3 weeks ago

No improvement with --haploid and --haplo-divergence 0.2. What divergence does the value of 0.2 imply?

Dmitry-Antipov commented 3 weeks ago

Could you share graph and colors for the most recent run once again?

What divergence does the value of 0.2 imply?

It should help to detect homologous regions and thus better assign colors(by Hi-C phasing) for them. But corresponding regions (aka nodes in graph) should be long enough to make it work.

rfinkers commented 3 weeks ago

share.tar.gz Tnx, please find the info attached. The species is a tricky diploid one with a haploid genome estimated to be 5Gb. At least one chromosome is estimated to be (mostly) homozygous; estimated about heterozygosity, of the two genomes, in the other chromosomes are estimated to be above 3-5%, but likely varies from chromosome-to-chromosome. But the prediction flows out there have difficulties with these estimates. This relates to my divergence question. Is 0.2 sufficiently high enough (though it is the max) to detect homologous regions in diverse species? Looking forward to your insights.

rfinkers commented 2 weeks ago

As an addition to above, Hifi heterozygous/homozygous depth = 40x/80x and ONT heterozygous/homozygous depth = 11x/21x

Dmitry-Antipov commented 2 weeks ago

I've checked, still see both problems we discussed above. Can you also share node sequences (8-hicPipeline/unitigs.fasta & 8-hicPipeline/unitigs.hpc.fasta)?

Alternatively to fastas I can look on mashmap's mappings counted on your side - 8-hicPipeline/run_mashmap.sh & 8-hicPipeline/mashmap.out Last file can be big; you can safely filter out short alignments by awk '$11 >= 50000' mashmap.out > mashmap50.out and send only smaller mashmap50.out

rfinkers commented 2 weeks ago

With respect to sharing the fastas; we'll have to see how to perform this in a secure manner. I'll generate the mashmap50.out file as alternative. But as the mashmap.out file was cleaned, I'll rerun 8-hicPipeline/run_mashmap.sh; which will take some days. Not sure if this helps, but untigs.fasta (7.4G) and unitigs.hpc.fasta (5.2G) raw sizes. Haploid is approx 4.5-5Gb

Dmitry-Antipov commented 2 weeks ago

yeah, sharing fastas can be sensitive issue. I need them exactly to look on mashmap's results, so running it on your side and sending us resulting mappings is perfectly fine.

rfinkers commented 2 weeks ago

mashmap60.out.gz mashmap50.out would be to large to upload (GitHub limits); so used awk '$11 >= 60000' mashmap.out > mashmap60.out instead. Hope this is fine. Looking forward hearing your insights.

Dmitry-Antipov commented 2 weeks ago

OK, so mashmap "alignments" are really fragmented, we just do not see enough similarity to phase corresponding nodes. I.e. utig4-14078 is likely homologous to utig4-18672, but the total length of found homologous stretches reported by mashmap is just 2.5M of >20M length. So, returning to your question, it looks that 0.2 is really not enough here. Actually you can try something like 0.25 as --haplo-divergence but mashmap is extremely slow with low sequence identity (verkko's haplo_divergence equals to (100 - mashmap_percent_identity)/100 , so efficiently it is running mashmap with --pi 75) and we never were patient enough to check the results.

We can make some improvements on our side, to test that they are somehow reasonable we'll also need hic.byread.compressed (just pairwise hi-c mapping counts) from 8-hicPipeline

@skoren, do you have any suggestions about improving bulge removal in haploid part of this genome(see attached figure) erroneous_bulges

rfinkers commented 2 weeks ago

xab.gz xaa.gz Ok, needed to split the file in two, in order to be able to circumvent GitHub limits.

@Dmitry-Antipov if I read your suggestion correctly, it would be ok te reexecute the run_mashmap.sh script again with the modified --pi settings. Correct? Or would a rerun of the complete pipeline be necessary with the modified haplo-divergence 0.25 setting?

Dmitry-Antipov commented 2 weeks ago

you can edit that script and then rerun hic_phasing.sh to see whether the number of non-phased nodes (phased ones are listed in hicverkko.colors.tsv) will be significantly reduced or not.

skoren commented 5 days ago

On the bubble popping issue, verkko doesn't expect to have such a wide range of heterozygosity in the same genome and we rely on a global coverage estimate from the graph. This is why the haploid option didn't help. Some chromosomes/components are diploid and their average coverage is per haplotype while those with the unpopped bubbles are almost homozygous so the coverage there is per haplotype * 2. These are considered repeats and so aren't processed. I've made a branch to address this (using local component coverage and not global coverage) and the results look much more reasonable on the graph you shared. Once it's tested more and integrated into master you should be able to re-run from the 5-untip step.

rfinkers commented 3 days ago

the size of hicverkko.colors.tsv of the previous run and after rerun hic_phasing.sh is similar. Few changes, but the majority of contigs between both files are in common. Food for thought. Also, I'm still puzzled by the 5x coverage bubbles in the graph above. Haploid HiFi depth is 40x/ homozygous 80x. These bubbles could not come from incorporation of the ONT or HIC data?

skoren commented 2 days ago

The coverage on those nodes is based solely on HiFi kmers so they couldn't have come from ONT data (which would have 0 HiFi k-mers) and Hi-C data doesn't add any sequence to the graph. I suspect there is either some low-level somatic variation (if this is a cell line or similar) or some kind of systematic error.

rfinkers commented 2 days ago

It's a diploid plant species, haploid genome ~5Gb. Slowly getting some additional information from orthogonal datasets. Some chromosomes are similar / some chromosomes are diverged. With ~7.5Gb, the total size of the assembly Is not that bad (Verkko / p_utg Hifiasm ~8Gb) . I see signals that contigs from some chromosomes are nicely phase separated while in others the hamming rate is extremely high. I can share some more insights, but not via this medium. The update in terms of "using local component coverage and not global coverage" you commented on three days ago sounds like one strategy that would push things forward. I'm happy to give it a try, even if it is still in a test branch, but will wait util you give the go ahead @skoren. @Dmitry-Antipov does it make sense to look further at the mashmap output / hicverkko.colors.tsv? Or is it clear enough that this was not the way forward?

Dmitry-Antipov commented 2 days ago

I have all the data needed for now. believe that it's possible to see at least some improvement on unphased contigs, but just didn't implement corresponding fix yet. Hope to update on this issue at the beginning on next week.

rfinkers commented 1 day ago

I have some other things to do, but can have more focus on this project again in August