Closed diego-rt closed 1 month ago
Is this a targeted assembly or targeted sequencing? Verkko depends on coverage being relatively even in each assembly component (typically a chromosome) so if you have enriched some genomic loci it could confuse the repeat/unique detection and harm resolution.
--max-k 35000
parameter to verkko from the default of 15 kbp. --unitig-abundance
from the default of 2 to 4 or 5. This is the minimum average coverage of the initial MBG unitigs to keep. You can run with defaults and compare the total bases in the MBG graph and node counts to the HiFi-only result. If it is much higher, especially the total bases, then definitely try increasing the abundance parameter.--nano
reads. There are some cases where HERRO can split reads or potentially introduce haplotype errors that the original reads can help address during gap-filling/resolution--hifi
, you could try it out and report back how it well it works.Hi @skoren
Sorry for the late reply but I was testing out your suggestions. In this case it is targeted assembly and I'm trying to patch the gaps in a hifiasm assembly.
--max-k 50000
using a combination of HiFi and dorado corrected ONT reads and in most assembly jobs it gave comparable results as --max-k 15000
and in some satellites it managed to resolve things that were previously unresolved. Note that these are targeted assemblies of specific loci of 5-100 Mb so perhaps the chances of encountering some problematic edge case are lower than in a whole genome assembly. --unitig-abundance 4
in my case. I have 56x ONT reads and 72x HiFi reads.
3-4. I can confirm that it is better to provide the original ONT reads in the --nano
option. I think a bigger problem is that dorado correct is currently discarding reads in satellites because of dropped high frequency minimisers (at least in my genome). So there are plenty of regions missing in the corrected reads.I also wanted to ask, do you have any advice on the easiest way to generate pseudohaplotypes or primary assemblies from the verkko graph? Unfortunately we don't have HiC data for our sample. At the moment this limits us to using verkko to patch up only homozygous assembly gaps, which is a shame because it's managing to resolve a lot of heterozygous ones as well that hifiasm can't handle.
Many thanks!
As for discarded dorado reads, I thought this was addressed though updates to the mapping parameters before correction in a recent version?
Unfortunately not an easy way to obtain pseudohaplotypes from verkko, it's designed to expect some phasing information like trio, hic, or strandseq. You could try using the hifiasm primary/alt assemblies to make marker databases for use with verkko trio but that will lead to issues in areas where there is no assembly in hifiasm or where it made a haplotype choice incompatible with verkko's graph.
Well, in my experience with dorado 0.7.3 I still lose all the reads at satellites, so I don't think this has been addressed yet.
In my case I'm just assembling tangles and gaps in a targeted fashion so the graphs are relatively simple. Would I get a 'correct' contig if I joined a random walk through the graph from one tip to the other (assuming no complex graph topologies occur and its just an alternating succession of homozygous and heterozygous nodes)? I assume there are no overlaps between nodes right?
Thanks!
It should be OK to do the random walk, yes. However, the verkko graphs are in HPC-space so you can't just take the node paths as you'd end up with an HPC sequence. Instead, take a look at the paths option: https://github.com/marbl/verkko?tab=readme-ov-file#consensus-for-user-provided-paths which you can give GAF-formatted random walks and it will generate a new consensus.
Oh that's great to know!! Thanks a lot!
Well, in my experience with dorado 0.7.3 I still lose all the reads at satellites, so I don't think this has been addressed yet.
In my case I'm just assembling tangles and gaps in a targeted fashion so the graphs are relatively simple. Would I get a 'correct' contig if I joined a random walk through the graph from one tip to the other (assuming no complex graph topologies occur and its just an alternating succession of homozygous and heterozygous nodes)? I assume there are no overlaps between nodes right?
Thanks!
Watching this thread as I'm doing something similar to you--definitely switch to Dorado v0.8.0 immediately--as skoren suggests, the new Dorado fixes some of what they broke trying to simplify HERRO in the previous versions of dorado correct. Also, you can do the batching and inference as separate steps, which is very convenient--you can use a large RAM CPU node for all-vs-all batching and only need GPU for inference.
I'm going to give it a try but my understanding was that the change in the number of max alignments was already introduced in v0.7.3 and that has not resolved it for me. The splitting of alignment and inference is certainly cool but I'm not sure it will resolve this issue.
That's what they claimed, but v0.7.3 introduced so many new errors compared to the previous version that I didn't really trust the output.
I can confirm that I get an identical number of reads corrected between dorado 0.7.3 and 0.8.0
Hi @skoren
Sorry, one more thing. Do you have any thoughts or experience with modifying the --min-identity
value now that ONT reads are solidly Q20-Q25 ?
Would you raise it to something like 98% ? Do you think this could help with resolution at satellites or it's better not to modify it?
I got the following numbers in the ONT alignment log with --min-identity 0.98
. Is the number of end-to-end alignments a concern?
Alignment finished
Input reads: 49755 (2125212955bp)
Seeds found: 246383275
Seeds extended: 239162
Reads with a seed: 49755 (2125212955bp)
Reads with an alignment: 47914 (1943849209bp)
Alignments: 52716 (1955775316bp) (751935 additional alignments discarded)
End-to-end alignments: 2429 (89758180bp)
Alignment broke with some reads. Look at stderr output.
Sorry, I missed this comment. Haven't experimented much with changing identity so I'm not sure if it makes a difference. The end-to-end alignments seem similar to what we see with default parameters, here's a recent human run:
Alignment finished
Input reads: 125041 (2104242421bp)
Seeds found: 423713155
Seeds extended: 276352
Reads with a seed: 59513 (2015050941bp)
Reads with an alignment: 55280 (1958629756bp)
Alignments: 63145 (2031584274bp) (75318 additional alignments discarded)
End-to-end alignments: 580 (13508030bp)
Hi @skoren,
From my testing, it seems like raising --min-identity
to high values like 0.98 results in a large number of GraphAligner alignments being fragmented and being output on different lines, rather than threading through multiple nodes. I assume this is bad for resolution and verkko expects a single alignment?
I find this quite unexpected since the identity of the read below is around 99.6% according to the id
tag.
With --min-identity 0.98
:
5a9db5cb-583e-46af-8dae-36e803fe792c 61206 30050 61189 + <utig1-382>utig1-383 283186 181695 212843 31104 31168 60 NM:i:64 AS:f:27939.6 dv:f:0.00205339 id:f:0.997947
5a9db5cb-583e-46af-8dae-36e803fe792c 61206 11365 29345 + <utig1-382 201283 163003 180965 17937 17994 60 NM:i:57 AS:f:15130.6 dv:f:0.00316772 id:f:0.996832
5a9db5cb-583e-46af-8dae-36e803fe792c 61206 0 11074 + <utig1-382 201283 151625 162710 11058 11096 60 NM:i:38 AS:f:9174.38 dv:f:0.00342466 id:f:0.996575
With defaults (same read):
5a9db5cb-583e-46af-8dae-36e803fe792c 61206 0 61206 + <utig1-382>utig1-383 283186 151625 212860 60987 61339 60 NM:i:352 AS:f:58861.7 dv:f:0.0057386 id:f:0.994261
I was hoping that by raising the stringency of this filter one could resolve tangles a bit better. For example, by raising the --max-k
to 50000
I've managed to resolve this satellite to this state. But it seems like the herro quality is not good enough to further raise this threshold and I was hoping the nanopore alignments could still help a bit.
Do you have any suggestions on what I can try next?
It's not surprising that the higher stringency is breaking the alignments. The error rate is an average over a quite long read so 99.4 is about 400 errors. If these are concentrated in the read and not spread over it evenly, I can see that some alignments will get split because you're limiting the bounds off the diagonal.
The tangle looks quite large and 2+ copy so I'm not sure there is much you can do to resolve it, not every tangle can be resolved with the data. You could try to start by removing some lower coverage likely spurious nodes and attempting to re-align the ONT reads to the updated graph to see if it makes a difference. You can also generate some candidate traversals and compare them to each other, if they are all within some very similar identity you could accept a random one.
Hi @skoren ,
Many thanks once again for your feedback and discussion. It is greatly appreciated!
--precise-clipping arg clip the alignment ends with arg as the identity
cutoff between correct / wrong alignments (double)
(default 0.66)
--min-unitig-abundance
flag for purging spurious nodes derived from mistakes in the reads. However, do you have any experience with the -R
flag in MBG (hidden --max-r
in verkko, for any reader passing by) to ditch low coverage kmers? My reasoning is that when combining HiFi reads with a good length profile (good coverage > 20kbp) with HERRO corrected ONT, maybe it is safe to further increase -R
to maybe up to 10k or so? Also, when assembling satellites, maybe one needs somewhat larger kmers to distinguish sequencing mistakes from variants? Along that note, how are "low coverage k-mers" defined? n=1? I assume this is not through the -a
flag?
-R: allow the multiplex DBG resolution to remove low coverage k-mers when k<R. Often low coverage short k-mers are sequencing errors so a low R (around 1000-4000) can improve the assembly by removing sequencing errors.
replaced 0 nodes with 0 nodes
, does that mean that this created a tip that was omitted because of some of the other filters? In this specific case, this would mean that it is detrimental to raise the max kmer length above 49098?try resolve k=46616, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46617, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46988, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46989, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47025, replaced 0 nodes with 0 nodes
try resolve k=47600, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47601, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47693, replaced 0 nodes with 0 nodes
try resolve k=48583, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48584, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48822, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48823, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48894, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48895, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49097, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49098, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49189, replaced 0 nodes with 0 nodes
try resolve k=50390, replaced 0 nodes with 0 nodes
try resolve k=50628, replaced 0 nodes with 0 nodes
try resolve k=51371, replaced 0 nodes with 0 nodes
try resolve k=51374, replaced 0 nodes with 0 nodes
try resolve k=52272, replaced 0 nodes with 0 nodes
try resolve k=52402, replaced 0 nodes with 0 nodes
try resolve k=52742, replaced 0 nodes with 0 nodes
try resolve k=53558, replaced 0 nodes with 0 nodes
try resolve k=53769, replaced 0 nodes with 0 nodes
try resolve k=54734, replaced 0 nodes with 0 nodes
try resolve k=54738, replaced 0 nodes with 0 nodes
try resolve k=55023, replaced 0 nodes with 0 nodes
try resolve k=55332, replaced 0 nodes with 0 nodes
try resolve k=55454, replaced 0 nodes with 0 nodes
try resolve k=55963, replaced 1 nodes with 4 nodes, unitigified 8 nodes to 2 nodes
try resolve k=58338, replaced 0 nodes with 0 nodes
try resolve k=58565, replaced 0 nodes with 0 nodes
try resolve k=58611, replaced 0 nodes with 0 nodes
try resolve k=58821, replaced 0 nodes with 0 nodes
try resolve k=59395, replaced 0 nodes with 0 nodes
try resolve k=60554, replaced 0 nodes with 0 nodes
try resolve k=60674, replaced 0 nodes with 0 nodes
try resolve k=61364, replaced 0 nodes with 0 nodes
try resolve k=61967, replaced 0 nodes with 0 nodes
try resolve k=63714, replaced 0 nodes with 0 nodes
try resolve k=64115, replaced 0 nodes with 0 nodes
try resolve k=64184, replaced 0 nodes with 0 nodes
try resolve k=64452, replaced 0 nodes with 0 nodes
try resolve k=64623, replaced 0 nodes with 0 nodes
try resolve k=65403, replaced 0 nodes with 0 nodes
try resolve k=65467, replaced 0 nodes with 0 nodes
try resolve k=67024, replaced 0 nodes with 0 nodes
try resolve k=68422, replaced 0 nodes with 0 nodes
try resolve k=68686, replaced 0 nodes with 0 nodes
try resolve k=68739, replaced 0 nodes with 0 nodes
try resolve k=70121, replaced 0 nodes with 0 nodes
try resolve k=70645, replaced 0 nodes with 0 nodes
try resolve k=70695, replaced 0 nodes with 0 nodes
try resolve k=71391, replaced 0 nodes with 0 nodes
try resolve k=71963, replaced 0 nodes with 0 nodes
try resolve k=76023, replaced 0 nodes with 0 nodes
try resolve k=78135, replaced 0 nodes with 0 nodes
try resolve k=79093, replaced 0 nodes with 0 nodes
try resolve k=81682, replaced 0 nodes with 0 nodes
try resolve k=81898, replaced 0 nodes with 0 nodes
try resolve k=83439, replaced 0 nodes with 0 nodes
try resolve k=85424, replaced 0 nodes with 0 nodes
try resolve k=86250, replaced 0 nodes with 0 nodes
try resolve k=88606, replaced 0 nodes with 0 nodes
try resolve k=93899, replaced 0 nodes with 0 nodes
try resolve k=95515, replaced 0 nodes with 0 nodes
Many thanks once again for your help!
Hey there,
I'm exploring using verkko to assemble some relatively large satellites and segmental duplications in a targeted fashion.
Since the typical HiFi reads + UL reads haven't been sufficient to resolve these tangles, I'm considering using ONT HERRO error corrected reads directly instead of HiFi reads. Do you have any suggestions on the parameters to tune for this use case?
For instance, would you recommend increasing the max k-mer for MBG? And if so, is there any concern in increasing it too much? I don't understand why is there this cap since my understanding is that it would typically be limited by the read length anyway?
I guess its a bad idea to mix together error corrected ONT and HiFi reads given the mix of error profiles right?
Does it make sense to still use the ONT reads in the
--nano
when using error corrected ONT reads in the--hifi
parameter? I guess its rather pointless right?Or would you rather use HiFi reads as HiFi reads and rather use the error corrected ONT reads in the
--nano
parameter but with a much stricter--min-identity
threshold like 0.999 (Q30) or 0.9999 (Q40) ?Many thanks in advance!