marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
304 stars 29 forks source link

Verkko using HERRO corrected ONT reads #289

Closed diego-rt closed 1 month ago

diego-rt commented 2 months ago

Hey there,

I'm exploring using verkko to assemble some relatively large satellites and segmental duplications in a targeted fashion.

Since the typical HiFi reads + UL reads haven't been sufficient to resolve these tangles, I'm considering using ONT HERRO error corrected reads directly instead of HiFi reads. Do you have any suggestions on the parameters to tune for this use case?

  1. For instance, would you recommend increasing the max k-mer for MBG? And if so, is there any concern in increasing it too much? I don't understand why is there this cap since my understanding is that it would typically be limited by the read length anyway?

  2. I guess its a bad idea to mix together error corrected ONT and HiFi reads given the mix of error profiles right?

  3. Does it make sense to still use the ONT reads in the --nano when using error corrected ONT reads in the --hifi parameter? I guess its rather pointless right?

  4. Or would you rather use HiFi reads as HiFi reads and rather use the error corrected ONT reads in the --nano parameter but with a much stricter --min-identity threshold like 0.999 (Q30) or 0.9999 (Q40) ?

Many thanks in advance!

skoren commented 2 months ago

Is this a targeted assembly or targeted sequencing? Verkko depends on coverage being relatively even in each assembly component (typically a chromosome) so if you have enriched some genomic loci it could confuse the repeat/unique detection and harm resolution.

  1. I wouldn't increase the k-mer since the herro corrected reads may have more noise than HiFi. You can increase the max k for the multiplex step using --max-k 35000 parameter to verkko from the default of 15 kbp.
  2. We mix HiFi and HERRO reads with good results, it should be fine to use them together. If you end up with very high coverage (>100x) or have very high coverage or older HERRO data (R9, R10 prior to Q28), it can be useful to increase --unitig-abundance from the default of 2 to 4 or 5. This is the minimum average coverage of the initial MBG unitigs to keep. You can run with defaults and compare the total bases in the MBG graph and node counts to the HiFi-only result. If it is much higher, especially the total bases, then definitely try increasing the abundance parameter.
  3. Yes, we typically provide HERRO reads as HiFi reads and the uncorrected original ONT reads as the --nano reads. There are some cases where HERRO can split reads or potentially introduce haplotype errors that the original reads can help address during gap-filling/resolution
  4. We haven't tried that, usually we combine HiFi + HERRO data as --hifi, you could try it out and report back how it well it works.
diego-rt commented 1 month ago

Hi @skoren

Sorry for the late reply but I was testing out your suggestions. In this case it is targeted assembly and I'm trying to patch the gaps in a hifiasm assembly.

  1. I tested with up to --max-k 50000 using a combination of HiFi and dorado corrected ONT reads and in most assembly jobs it gave comparable results as --max-k 15000 and in some satellites it managed to resolve things that were previously unresolved. Note that these are targeted assemblies of specific loci of 5-100 Mb so perhaps the chances of encountering some problematic edge case are lower than in a whole genome assembly.
  2. I can confirm that it works quite well to mix both data types. I used --unitig-abundance 4 in my case. I have 56x ONT reads and 72x HiFi reads. 3-4. I can confirm that it is better to provide the original ONT reads in the --nano option. I think a bigger problem is that dorado correct is currently discarding reads in satellites because of dropped high frequency minimisers (at least in my genome). So there are plenty of regions missing in the corrected reads.

I also wanted to ask, do you have any advice on the easiest way to generate pseudohaplotypes or primary assemblies from the verkko graph? Unfortunately we don't have HiC data for our sample. At the moment this limits us to using verkko to patch up only homozygous assembly gaps, which is a shame because it's managing to resolve a lot of heterozygous ones as well that hifiasm can't handle.

Many thanks!

skoren commented 1 month ago

As for discarded dorado reads, I thought this was addressed though updates to the mapping parameters before correction in a recent version?

Unfortunately not an easy way to obtain pseudohaplotypes from verkko, it's designed to expect some phasing information like trio, hic, or strandseq. You could try using the hifiasm primary/alt assemblies to make marker databases for use with verkko trio but that will lead to issues in areas where there is no assembly in hifiasm or where it made a haplotype choice incompatible with verkko's graph.

diego-rt commented 1 month ago

Well, in my experience with dorado 0.7.3 I still lose all the reads at satellites, so I don't think this has been addressed yet.

In my case I'm just assembling tangles and gaps in a targeted fashion so the graphs are relatively simple. Would I get a 'correct' contig if I joined a random walk through the graph from one tip to the other (assuming no complex graph topologies occur and its just an alternating succession of homozygous and heterozygous nodes)? I assume there are no overlaps between nodes right?

Thanks!

skoren commented 1 month ago

It should be OK to do the random walk, yes. However, the verkko graphs are in HPC-space so you can't just take the node paths as you'd end up with an HPC sequence. Instead, take a look at the paths option: https://github.com/marbl/verkko?tab=readme-ov-file#consensus-for-user-provided-paths which you can give GAF-formatted random walks and it will generate a new consensus.

diego-rt commented 1 month ago

Oh that's great to know!! Thanks a lot!

cblazier commented 1 month ago

Well, in my experience with dorado 0.7.3 I still lose all the reads at satellites, so I don't think this has been addressed yet.

In my case I'm just assembling tangles and gaps in a targeted fashion so the graphs are relatively simple. Would I get a 'correct' contig if I joined a random walk through the graph from one tip to the other (assuming no complex graph topologies occur and its just an alternating succession of homozygous and heterozygous nodes)? I assume there are no overlaps between nodes right?

Thanks!

Watching this thread as I'm doing something similar to you--definitely switch to Dorado v0.8.0 immediately--as skoren suggests, the new Dorado fixes some of what they broke trying to simplify HERRO in the previous versions of dorado correct. Also, you can do the batching and inference as separate steps, which is very convenient--you can use a large RAM CPU node for all-vs-all batching and only need GPU for inference.

diego-rt commented 1 month ago

I'm going to give it a try but my understanding was that the change in the number of max alignments was already introduced in v0.7.3 and that has not resolved it for me. The splitting of alignment and inference is certainly cool but I'm not sure it will resolve this issue.

cblazier commented 1 month ago

That's what they claimed, but v0.7.3 introduced so many new errors compared to the previous version that I didn't really trust the output.

diego-rt commented 1 month ago

I can confirm that I get an identical number of reads corrected between dorado 0.7.3 and 0.8.0

diego-rt commented 1 month ago

Hi @skoren

Sorry, one more thing. Do you have any thoughts or experience with modifying the --min-identity value now that ONT reads are solidly Q20-Q25 ?

Would you raise it to something like 98% ? Do you think this could help with resolution at satellites or it's better not to modify it?

I got the following numbers in the ONT alignment log with --min-identity 0.98. Is the number of end-to-end alignments a concern?

Alignment finished
Input reads: 49755 (2125212955bp)
Seeds found: 246383275
Seeds extended: 239162
Reads with a seed: 49755 (2125212955bp)
Reads with an alignment: 47914 (1943849209bp)
Alignments: 52716 (1955775316bp) (751935 additional alignments discarded)
End-to-end alignments: 2429 (89758180bp)
Alignment broke with some reads. Look at stderr output.
skoren commented 1 month ago

Sorry, I missed this comment. Haven't experimented much with changing identity so I'm not sure if it makes a difference. The end-to-end alignments seem similar to what we see with default parameters, here's a recent human run:

Alignment finished
Input reads: 125041 (2104242421bp)
Seeds found: 423713155
Seeds extended: 276352
Reads with a seed: 59513 (2015050941bp)
Reads with an alignment: 55280 (1958629756bp)
Alignments: 63145 (2031584274bp) (75318 additional alignments discarded)
End-to-end alignments: 580 (13508030bp)
diego-rt commented 1 month ago

Hi @skoren,

From my testing, it seems like raising --min-identity to high values like 0.98 results in a large number of GraphAligner alignments being fragmented and being output on different lines, rather than threading through multiple nodes. I assume this is bad for resolution and verkko expects a single alignment?

I find this quite unexpected since the identity of the read below is around 99.6% according to the id tag.

With --min-identity 0.98 :

5a9db5cb-583e-46af-8dae-36e803fe792c    61206   30050   61189   +       <utig1-382>utig1-383    283186  181695  212843  31104   31168   60      NM:i:64 AS:f:27939.6    dv:f:0.00205339 id:f:0.997947
5a9db5cb-583e-46af-8dae-36e803fe792c    61206   11365   29345   +       <utig1-382      201283  163003  180965  17937   17994   60      NM:i:57 AS:f:15130.6    dv:f:0.00316772 id:f:0.996832
5a9db5cb-583e-46af-8dae-36e803fe792c    61206   0       11074   +       <utig1-382      201283  151625  162710  11058   11096   60      NM:i:38 AS:f:9174.38    dv:f:0.00342466 id:f:0.996575

With defaults (same read):

5a9db5cb-583e-46af-8dae-36e803fe792c    61206   0   61206   +   <utig1-382>utig1-383    283186  151625  212860  60987   61339   60  NM:i:352    AS:f:58861.7    dv:f:0.0057386  id:f:0.994261

I was hoping that by raising the stringency of this filter one could resolve tangles a bit better. For example, by raising the --max-k to 50000 I've managed to resolve this satellite to this state. But it seems like the herro quality is not good enough to further raise this threshold and I was hoping the nanopore alignments could still help a bit.

Do you have any suggestions on what I can try next?

unitigs

skoren commented 1 month ago

It's not surprising that the higher stringency is breaking the alignments. The error rate is an average over a quite long read so 99.4 is about 400 errors. If these are concentrated in the read and not spread over it evenly, I can see that some alignments will get split because you're limiting the bounds off the diagonal.

The tangle looks quite large and 2+ copy so I'm not sure there is much you can do to resolve it, not every tangle can be resolved with the data. You could try to start by removing some lower coverage likely spurious nodes and attempting to re-align the ONT reads to the updated graph to see if it makes a difference. You can also generate some candidate traversals and compare them to each other, if they are all within some very similar identity you could accept a random one.

diego-rt commented 1 month ago

Hi @skoren ,

Many thanks once again for your feedback and discussion. It is greatly appreciated!

  1. Regarding the identity cutoff for the alignments, yes that makes perfect sense. I just expected a different behaviour after reading the GraphAligner description for this flag, which led me to believe that it would drop alignments whose average identity was below XX%, so I didn't understand why it would create internal gaps.
  --precise-clipping arg     clip the alignment ends with arg as the identity 
                             cutoff between correct / wrong alignments (double)
                             (default 0.66)
  1. And thanks as well for the suggestion on the tangle resolution. I think that's quite reasonable. Just one last thing: you mentioned the usefulness of the --min-unitig-abundance flag for purging spurious nodes derived from mistakes in the reads. However, do you have any experience with the -R flag in MBG (hidden --max-r in verkko, for any reader passing by) to ditch low coverage kmers?

My reasoning is that when combining HiFi reads with a good length profile (good coverage > 20kbp) with HERRO corrected ONT, maybe it is safe to further increase -R to maybe up to 10k or so? Also, when assembling satellites, maybe one needs somewhat larger kmers to distinguish sequencing mistakes from variants? Along that note, how are "low coverage k-mers" defined? n=1? I assume this is not through the -a flag?

-R: allow the multiplex DBG resolution to remove low coverage k-mers when k<R. Often low coverage short k-mers are sequencing errors so a low R (around 1000-4000) can improve the assembly by removing sequencing errors.
  1. And lastly, I've been playing with the max kmer values and wanted to ask you how should I interpret these results. If I understand correctly, once the kmer values are listed as replaced 0 nodes with 0 nodes, does that mean that this created a tip that was omitted because of some of the other filters? In this specific case, this would mean that it is detrimental to raise the max kmer length above 49098?
try resolve k=46616, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46617, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46988, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=46989, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47025, replaced 0 nodes with 0 nodes
try resolve k=47600, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47601, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=47693, replaced 0 nodes with 0 nodes
try resolve k=48583, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48584, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48822, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48823, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48894, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=48895, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49097, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49098, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=49189, replaced 0 nodes with 0 nodes
try resolve k=50390, replaced 0 nodes with 0 nodes
try resolve k=50628, replaced 0 nodes with 0 nodes
try resolve k=51371, replaced 0 nodes with 0 nodes
try resolve k=51374, replaced 0 nodes with 0 nodes
try resolve k=52272, replaced 0 nodes with 0 nodes
try resolve k=52402, replaced 0 nodes with 0 nodes
try resolve k=52742, replaced 0 nodes with 0 nodes
try resolve k=53558, replaced 0 nodes with 0 nodes
try resolve k=53769, replaced 0 nodes with 0 nodes
try resolve k=54734, replaced 0 nodes with 0 nodes
try resolve k=54738, replaced 0 nodes with 0 nodes
try resolve k=55023, replaced 0 nodes with 0 nodes
try resolve k=55332, replaced 0 nodes with 0 nodes
try resolve k=55454, replaced 0 nodes with 0 nodes
try resolve k=55963, replaced 1 nodes with 4 nodes, unitigified 8 nodes to 2 nodes
try resolve k=58338, replaced 0 nodes with 0 nodes
try resolve k=58565, replaced 0 nodes with 0 nodes
try resolve k=58611, replaced 0 nodes with 0 nodes
try resolve k=58821, replaced 0 nodes with 0 nodes
try resolve k=59395, replaced 0 nodes with 0 nodes
try resolve k=60554, replaced 0 nodes with 0 nodes
try resolve k=60674, replaced 0 nodes with 0 nodes
try resolve k=61364, replaced 0 nodes with 0 nodes
try resolve k=61967, replaced 0 nodes with 0 nodes
try resolve k=63714, replaced 0 nodes with 0 nodes
try resolve k=64115, replaced 0 nodes with 0 nodes
try resolve k=64184, replaced 0 nodes with 0 nodes
try resolve k=64452, replaced 0 nodes with 0 nodes
try resolve k=64623, replaced 0 nodes with 0 nodes
try resolve k=65403, replaced 0 nodes with 0 nodes
try resolve k=65467, replaced 0 nodes with 0 nodes
try resolve k=67024, replaced 0 nodes with 0 nodes
try resolve k=68422, replaced 0 nodes with 0 nodes
try resolve k=68686, replaced 0 nodes with 0 nodes
try resolve k=68739, replaced 0 nodes with 0 nodes
try resolve k=70121, replaced 0 nodes with 0 nodes
try resolve k=70645, replaced 0 nodes with 0 nodes
try resolve k=70695, replaced 0 nodes with 0 nodes
try resolve k=71391, replaced 0 nodes with 0 nodes
try resolve k=71963, replaced 0 nodes with 0 nodes
try resolve k=76023, replaced 0 nodes with 0 nodes
try resolve k=78135, replaced 0 nodes with 0 nodes
try resolve k=79093, replaced 0 nodes with 0 nodes
try resolve k=81682, replaced 0 nodes with 0 nodes
try resolve k=81898, replaced 0 nodes with 0 nodes
try resolve k=83439, replaced 0 nodes with 0 nodes
try resolve k=85424, replaced 0 nodes with 0 nodes
try resolve k=86250, replaced 0 nodes with 0 nodes
try resolve k=88606, replaced 0 nodes with 0 nodes
try resolve k=93899, replaced 0 nodes with 0 nodes
try resolve k=95515, replaced 0 nodes with 0 nodes

Many thanks once again for your help!