marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
304 stars 29 forks source link

Questions related to interpretation results phased assembly. #245

Open rfinkers opened 7 months ago

rfinkers commented 7 months ago

I'm attempting an assembly (relatively heterozygous diploid genome; 5Gb haploid size; with Hifi, ONT & Hi-C).

In the directory 7-consensus, the uniting-popped fast = 7.3Gb; however the uniting.popped.haplotype1, haplotype2 and unsigned is 0 bytes in size.

Thus this imply that there is no contribution of the ONT data to the phasing; though there are reads in the ont_subset.fasta.gz?

The final assembly.fasta is still 7.3Gb; with the haplotype1 and 2 fast both being 1.3G in size. This phasing is the sole contribution from the Hi-C data?

What would be good parameters to re-atempt this assembly; to deal with the higher diversity in this species (compared to human)?

Tnx!

skoren commented 7 months ago

The initial assembly (the 7.3 Gbp) file is still phased, it's just not split into haplotypes. Each contig is from a single haplotype based on HiFi + ONT data information. In the second case, the assembly.fasta will still be the full assembly (haplotype1 + haplotype2 + unassigned) so it makes sense that it has not changed. It may have linked some sequences together within haplotypes that couldn't be linked without HiC, making them longer which would be reflected in sequence stats.

I suspect the issue is the assembly is almost completely phased already and there isn't anything for HiC to do except assign to a haplotype. This requires homology detection which may not be tolerating the diversity of this species vs human. You can try increasing the --haplo-divergence parameter from the default of 0.05 to 0.10 or 0.2 and see if more of the assembly is assigned. Perhaps @Dmitry-Antipov has other suggestions. If you can share the colors file and the noseq.gfa file here from the 8-hicPipeline folder that should give more info on what is happening with the assembly.

Dmitry-Antipov commented 7 months ago

Without graph and colors it is hard to say anything, but anyway, such large unassigned file is not normal. Increasing --haplo-divergence is the first idea for me too.

Which species is it? Do you expect to have large rDNA arrays, what is estimated level of heterozygosity?

rfinkers commented 7 months ago

share.zip Thanks @skoren , and yes, I was mixing-up terminology. See the files attached for feedback; I'l rerun the assembly with different --haplo-divergence settings and assess the impact on the final results.

Dmitry-Antipov commented 7 months ago

Aha, this is actually another problem. Graph nodes (which are constructed from ONT & HiFi but not HiC) are quite fragmented. This is not normal, those regions should be resolved with ONT reads. Without it there is just not enough HiC signal to assign labels for those short haplotype-specific regions - for some chromosome pairs there are thousands of nodes. We expect to have larger haploblocks with HiFi & ONT, and then phase/assign labels for them with HiC

The reason seems to be a verkko problem, but on the graph simplification/ONT alignment stages, and not on HiC phasing - I see huge amount of fake bulges with really different coverage between "haplotypes" - @skoren can you have a look? fake_bulges

--haplo-divergence should not actually change anything here.

Dmitry-Antipov commented 7 months ago

Yet another assumption - is it a cell line or a tissue sample? Possibly those numerous 5x vs 50x bubbles can represent problems with the cell line?

skoren commented 7 months ago

The genome seems to be a combination of quite diverge but then some extremely homozygous regions (like the one @Dmitry-Antipov posted above). In normal homozygous genomes those low coverage nodes would be simplified and you'd end up with a single large node but here, because it's double the coverage of most of the genome, they aren't. I do think that --haplo-divergence will assign more of the assembly but these very homozygous regions are likely to remain in the unassigned bin and probably belong in both haplotypes.

Dmitry-Antipov commented 7 months ago

yep, Sergey is right, I did not notice that in addition to that underresolved problem there are some unlabeled large nodes - those labels can be fixed with haplo-divergence. But it definitely will not help with the fragmented+unassigned problem i was writing about

rfinkers commented 7 months ago

Gentleman, thanks so far, there is definitively some food for thought here before providing some feedback (next week). In the meanwhile, I'll rerun the final steps with a larger haplotype divergence setting and take that also along int he feedback.

skoren commented 7 months ago

When you re-run with the halo-divergence change, I'd suggest also adding the --haploid option. You don't need to re-run the full pipeline from scratch, in fact it's better if you don't so you can keep the read/node correspondences to compare the results. This would mean duplicating your current asm folder and removing all the assembly. files along with the `5-, 6-, 7-, and 8-*` steps. That should better handle these very homozygous regions and pop those bubbles. It will likely still not assign them if the entire component is homozygous but at least the assembly will be less fragmented.

rfinkers commented 5 months ago

Still working on understanding this issue; Besides the example you picked out that part of the genome is relatively homozygous (on one chromosome), on other chromosomes we might have much higher diversity. Could diversity be to high, which results in failure to detect that two homolouges belong to the same chromosome, result in not exporting these sequences to the fased fasta file? As in, this probably would als impact your previous suggestion regarding the assembly as '--haploid'?

skoren commented 5 months ago

Right, there's two issues. One is there is high diversity in some regions and two is there is some very homozygous regions. The --haplo-divergence would help with the first one. Without proper haplotype phasing the HiC signal won't be there to push the two haplotypes away. The change I suggested (--haploid) would address the second one by more aggressively removing bubbles in this assembly, hopefully extending the sequences for Hi-C phasing.

rfinkers commented 5 months ago

No improvement with --haploid and --haplo-divergence 0.2. What divergence does the value of 0.2 imply?

Dmitry-Antipov commented 5 months ago

Could you share graph and colors for the most recent run once again?

What divergence does the value of 0.2 imply?

It should help to detect homologous regions and thus better assign colors(by Hi-C phasing) for them. But corresponding regions (aka nodes in graph) should be long enough to make it work.

rfinkers commented 5 months ago

share.tar.gz Tnx, please find the info attached. The species is a tricky diploid one with a haploid genome estimated to be 5Gb. At least one chromosome is estimated to be (mostly) homozygous; estimated about heterozygosity, of the two genomes, in the other chromosomes are estimated to be above 3-5%, but likely varies from chromosome-to-chromosome. But the prediction flows out there have difficulties with these estimates. This relates to my divergence question. Is 0.2 sufficiently high enough (though it is the max) to detect homologous regions in diverse species? Looking forward to your insights.

rfinkers commented 5 months ago

As an addition to above, Hifi heterozygous/homozygous depth = 40x/80x and ONT heterozygous/homozygous depth = 11x/21x

Dmitry-Antipov commented 5 months ago

I've checked, still see both problems we discussed above. Can you also share node sequences (8-hicPipeline/unitigs.fasta & 8-hicPipeline/unitigs.hpc.fasta)?

Alternatively to fastas I can look on mashmap's mappings counted on your side - 8-hicPipeline/run_mashmap.sh & 8-hicPipeline/mashmap.out Last file can be big; you can safely filter out short alignments by awk '$11 >= 50000' mashmap.out > mashmap50.out and send only smaller mashmap50.out

rfinkers commented 5 months ago

With respect to sharing the fastas; we'll have to see how to perform this in a secure manner. I'll generate the mashmap50.out file as alternative. But as the mashmap.out file was cleaned, I'll rerun 8-hicPipeline/run_mashmap.sh; which will take some days. Not sure if this helps, but untigs.fasta (7.4G) and unitigs.hpc.fasta (5.2G) raw sizes. Haploid is approx 4.5-5Gb

Dmitry-Antipov commented 5 months ago

yeah, sharing fastas can be sensitive issue. I need them exactly to look on mashmap's results, so running it on your side and sending us resulting mappings is perfectly fine.

rfinkers commented 5 months ago

mashmap60.out.gz mashmap50.out would be to large to upload (GitHub limits); so used awk '$11 >= 60000' mashmap.out > mashmap60.out instead. Hope this is fine. Looking forward hearing your insights.

Dmitry-Antipov commented 5 months ago

OK, so mashmap "alignments" are really fragmented, we just do not see enough similarity to phase corresponding nodes. I.e. utig4-14078 is likely homologous to utig4-18672, but the total length of found homologous stretches reported by mashmap is just 2.5M of >20M length. So, returning to your question, it looks that 0.2 is really not enough here. Actually you can try something like 0.25 as --haplo-divergence but mashmap is extremely slow with low sequence identity (verkko's haplo_divergence equals to (100 - mashmap_percent_identity)/100 , so efficiently it is running mashmap with --pi 75) and we never were patient enough to check the results.

We can make some improvements on our side, to test that they are somehow reasonable we'll also need hic.byread.compressed (just pairwise hi-c mapping counts) from 8-hicPipeline

@skoren, do you have any suggestions about improving bulge removal in haploid part of this genome(see attached figure) erroneous_bulges

rfinkers commented 5 months ago

xab.gz xaa.gz Ok, needed to split the file in two, in order to be able to circumvent GitHub limits.

@Dmitry-Antipov if I read your suggestion correctly, it would be ok te reexecute the run_mashmap.sh script again with the modified --pi settings. Correct? Or would a rerun of the complete pipeline be necessary with the modified haplo-divergence 0.25 setting?

Dmitry-Antipov commented 5 months ago

you can edit that script and then rerun hic_phasing.sh to see whether the number of non-phased nodes (phased ones are listed in hicverkko.colors.tsv) will be significantly reduced or not.

skoren commented 4 months ago

On the bubble popping issue, verkko doesn't expect to have such a wide range of heterozygosity in the same genome and we rely on a global coverage estimate from the graph. This is why the haploid option didn't help. Some chromosomes/components are diploid and their average coverage is per haplotype while those with the unpopped bubbles are almost homozygous so the coverage there is per haplotype * 2. These are considered repeats and so aren't processed. I've made a branch to address this (using local component coverage and not global coverage) and the results look much more reasonable on the graph you shared. Once it's tested more and integrated into master you should be able to re-run from the 5-untip step.

rfinkers commented 4 months ago

the size of hicverkko.colors.tsv of the previous run and after rerun hic_phasing.sh is similar. Few changes, but the majority of contigs between both files are in common. Food for thought. Also, I'm still puzzled by the 5x coverage bubbles in the graph above. Haploid HiFi depth is 40x/ homozygous 80x. These bubbles could not come from incorporation of the ONT or HIC data?

skoren commented 4 months ago

The coverage on those nodes is based solely on HiFi kmers so they couldn't have come from ONT data (which would have 0 HiFi k-mers) and Hi-C data doesn't add any sequence to the graph. I suspect there is either some low-level somatic variation (if this is a cell line or similar) or some kind of systematic error.

rfinkers commented 4 months ago

It's a diploid plant species, haploid genome ~5Gb. Slowly getting some additional information from orthogonal datasets. Some chromosomes are similar / some chromosomes are diverged. With ~7.5Gb, the total size of the assembly Is not that bad (Verkko / p_utg Hifiasm ~8Gb) . I see signals that contigs from some chromosomes are nicely phase separated while in others the hamming rate is extremely high. I can share some more insights, but not via this medium. The update in terms of "using local component coverage and not global coverage" you commented on three days ago sounds like one strategy that would push things forward. I'm happy to give it a try, even if it is still in a test branch, but will wait util you give the go ahead @skoren. @Dmitry-Antipov does it make sense to look further at the mashmap output / hicverkko.colors.tsv? Or is it clear enough that this was not the way forward?

Dmitry-Antipov commented 4 months ago

I have all the data needed for now. believe that it's possible to see at least some improvement on unphased contigs, but just didn't implement corresponding fix yet. Hope to update on this issue at the beginning on next week.

rfinkers commented 4 months ago

I have some other things to do, but can have more focus on this project again in August

rfinkers commented 4 months ago

@Dmitry-Antipov is there an update (d branch) for me to test? Tnx!

Dmitry-Antipov commented 3 months ago

Hi, sorry for the delay, didn't finish experiments before going on vacations.

You can try the branch het_fixes Constants used there are very aggressive on phasing (attaching the phasing results on the last shared graph), but still some long nodes are not phased - i.e. for utig4-18391 it is really unclear what should be its homologous one based on sequence similarity. hicverkko.colors10.csv

Anyway, I'd be very accurate with phasing results here, and definitely check them with gene-based verification like compleasm

rfinkers commented 2 months ago

Rerun stuff from scratch; but getting stuck in processing files in 8-hic. Both mashmap jobs completes (2 results from the first set; 1 result from the second step (paths2ref.mashmap)), but than the flow crashes. A restart of slurm results in restarting a lot of the completed jobs (including the long (approx 1..5 weeks) mashmap jobs).

See the dag of jobs: job count


HiC_rdnascaff 1 alignBWA 37 hicPhasing 1 indexBWA 1 mergeBWA 1 prepareScaffolding 1 runRukkiHIC 1 scaffoldMergeBWA 1 scaffoldPrefilterBAM 37 transformBWA 1 total 82

Error likely has to do with not being able to find a file, but have not yet figured out which.

Is there a specific batch script left-over by Snakemake (batch-script directory) to restart only the last part manually? To have more fine-grained control on where to start debugging?

Dmitry-Antipov commented 2 months ago

What was the job crashed? tail of log file should help to get what happened.

If you have all mashmap jobs finished, there's quite a few scripts to run: bunch of scaffold_prefilter*.sh scripts: scaffold_mergeBWA.sh and hic_scaff.sh. Possibly they are already created in your 8-hicPipeline folder - do you see them?

My first suggestion is that rukki.paths.tsv & rukki.paths.gaf are the files that cause rerun because of suboptimal snakemake pipeline design. Do you see them in 8-hicPipeline folder?

rfinkers commented 2 months ago

rukki.paths.tsv & rukki.paths.gaf (they are identical?) are there.

The scaffold_prefilter*.sh, scaffold_mergeBWA.sh and hic_scaff.sh are lacking.

Error executing rule prepareScaffolding on cluster (jobid: 110, external: 109538, jobscript: /media/ont/ASM3.91/.snakemake/tmp.enu3oklq/verkko.prepareScaffolding.110.sh). For error details see the cluster log and the log files of the involved rule(s).

but cannot find more.

Dmitry-Antipov commented 2 months ago

Then it likely means that third mashmap crashed and paths2ref.mashmap is not valid. You can check what is the error in log, 8-hicPipeline/prepare_ref_mashmap.err

rfinkers commented 2 months ago

---Preparing rukki paths fasta ---Running MashMap [mashmap] MashMap v3.1.3 [mashmap] Reference = [../8-hicPipeline/paths.hpc.fasta] [mashmap] Query = [../8-hicPipeline/paths.hpc.fasta] [mashmap] Kmer size = 19 [mashmap] Sketch size = 690 [mashmap] Segment length = 10000 (read split allowed) [mashmap] No block length filtering [mashmap] Chaining gap max = 10000 [mashmap] Mappings per segment = 1 [mashmap] Percentage identity threshold = 80% [mashmap] Do not skip self mappings [mashmap] No hypergeometric filter [mashmap] Mapping output file = /dev/fd/63 [mashmap] Filter mode = 3 (1 = map, 2 = one-to-one, 3 = none) [mashmap] Execution threads = 8 [mashmap::skch::Sketch::build] minmer windows picked from reference = 737188003 [mashmap::skch::Sketch::index] unique minmers = 40515556 [mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 10102799) ... (704532, 1) [mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 47924 times during lookup. [mashmap::map] time spent computing the reference index: 417.324 sec [mashmap::skch::Map::mapQuery] WARNING, no .fai index found for ../8-hicPipeline/paths.hpc.fasta, reading the file to sum sequence length (slow) [mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.26e+04 bp/s elapsed: 05:01:41:40 remain: 00:00:00:006:04 [mashmap::skch::Map::mapQuery] count of mapped reads = 8976, reads qualified for mapping = 8978, total input reads = 8978, total input bp = 5522795566 [mashmap::map] time spent mapping the query: 4.38e+05 sec [mashmap::map] mapping results saved in: /dev/fd/63

Dmitry-Antipov commented 2 months ago

This output looks normal, strange.

Can you rerun verkko with snakemake dry-run option (verkko <> --snakeopts '--dry-run' > dry_rerun.log) and upload the resulting log?

rfinkers commented 2 months ago

dry_rerun.log sure, see attachment.

Dmitry-Antipov commented 2 months ago

Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it.

Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch. To do so, you should use run verkko --paths /8-hicPipeline/rukki.paths.gaf --assembly -d --hifi <> --ont <>

If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky.

However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right.

rfinkers commented 2 months ago

Ok, thanks. I'll try that over the weekend. I'm interested in the scaffolding as well, but that can be a later experiment. I'll first check the phasing step and validate with orthogonal datasets. Will keep you informed. tnx.

Dmitry-Antipov commented 2 months ago

OK, so for scaffolding:

1) Switch off cleanup in verkko.yml (keep_intermediate should be set to True) 2) Run snakemake pipeline until mergeBWA rule, easiest way is to add --until mergeBWA into a copy of snakemake.sh 3) Run snakemake --touch to update timestamps 4) Run rest of the pipeline with regular parameters.

For all runs I'd run in dry mode first to check that snakemake is not trying to rerun jobs that are already finished.

You can compare prescaf_rukki.paths.tsv and scaff_rukki.paths.tsv to see the effect of scaffolding.

rfinkers commented 2 months ago

Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it.

Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch. To do so, you should use run verkko --paths /8-hicPipeline/rukki.paths.gaf --assembly -d --hifi <> --ont <>

If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky.

However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right.

What is the right file for --assembly to use?

Dmitry-Antipov commented 2 months ago

What is the right file for --assembly to use?

Not file but path to the assembly dir.

rfinkers commented 2 months ago

Both haplotypes are now 0 bytes. No apparent errors in the logs. I'll dive a bit deeper.

skoren commented 2 months ago

I'd guess the issue is the original assembly (passed in --assembly) was not fully complete so the consensus (--paths) option didn't know how to partition the new consensus into haplotypes. Do the names in the assembly.fasta file start with haplotype1 or haplotype2 or similar? If so, you can still use the assembly.fasta as your result and just take any sequence with the name haplotype1 as hap1 and haplotype2 as hap2.

rfinkers commented 1 month ago

No, they were staring with contain-xxxxxx. Could'n find (until now) any other reason.

skoren commented 1 month ago

Something must have gone wrong with the paths processing, the paths file names contain things like haplotype1/2 correct? What was the exact --paths command?

rfinkers commented 4 weeks ago

verkko --paths ASM3.91/8-hicPipeline/rukki.paths.gaf

skoren commented 3 weeks ago

If the file is coming from the 8-hicPipeline folder, it should be phased and not be named contig-*. If not then this implies the whole Hi-C pipeline didn't run properly. Can you post a snipped of the file above?

Perhaps given that the original issue you had is fixed in v2.2.1 and that you had an issue with re-generating the consensus, it is easiest to just run a new assembly from scratch with v2.2.1 to make sure we're not chasing restart/mixing problems in the runs.

rfinkers commented 3 weeks ago

Will do so, at another moment/ticket we'll need to look at the memory allocation to mashmap. this is not sufficient (slurm submission) and I'll need to restart things manually (not a problem for now).

rfinkers commented 1 week ago

Ok, we are making progress here. The last version does indeed scaffold more into haplotypes (2x). Of the 7.2Gb basal assembly, 2Gb could not be assigned. The other are approx. 50/50 separated over the two haplotypes. I'm getting some orthogonal data as well to confirm some stuf. We can discuss things later in more detail, first need to process this.