vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Mapping noisy long-reads to assembly graphs #724

Open photocyte opened 7 years ago

photocyte commented 7 years ago

Hi there,

I'd be interested in mapping noisy long-reads to assembly graphs (in GFA format), to untangle paths through repetitive parts of the graphs.

Here is another github issue (https://github.com/jts/sga/issues/92), which states that vg can do this mapping of reads to GFA graphs, at least conceptually, but I expect the mapping algorithm in vg may not be robust to noisy long-reads.

Is mapping noisy long-reads to vg graphs supported? E.g. by an algorithm akin to https://github.com/isovic/graphmap ?

All the best, -Tim

edawson commented 7 years ago

Hey Tim,

I've mapped some Nanopore reads (6.5-7 kb in some cases, both 1D and 2D) to a graph built from a fasta + VCF. @shilpagarg has been working to build an assembly graph (using vg) from PacBio reads and then work from those.

We can certainly read GFA with vg, and the mapper has a banded mode that should work with long reads. I'd encourage you to try your pipeline and let us know if you hit any issues.

-Eric

ekg commented 7 years ago

@photocyte

It's definitely possible to align long reads to assembly graphs using vg. When reads are low error rate, this works almost perfectly. I have seen problems as the noise level increases, but it's not clear to me today which of those problems were generic to the aligner and which were specifically due to the fact that we were working with long noisy sequences. There's been a huge amount of improvement in the mapper recently, but I've only been able to validate that in long acyclic graphs.

I do most of my evaluation using simulations (https://github.com/vgteam/vg/blob/master/scripts/map-sim), which are driven by native tools in vg, so you could definitely try out the same techniques in you target graph and with the read error rates you expect to get a sense of how well vg will work for your application.

vg sim is a tool that simulates reads from the graph, which it expresses as one true alignment that the read should take through the graph, including errors. If the output alignments are passed to vg map --compare as GAM input, then vg map will print a description of the overlap between the new and old alignments. This can be used to very quickly get a sense of how well the mapper is performing.

vg map is a kind of generalization of bwa mem to sequence graphs. We implement many similar algorithms and consequently expose similar heuristic parameters, such as a minimum match length, a minimum cluster size, and a maximum hit count for the MEMs. I imagine we can explore using the settings that bwa mem suggests for pacbio as a starting point in getting long noisy read alignment right.

Beyond this, it is possible that there are issues for alignment of long reads to assembly graphs. Some of the positional clustering heuristics in the mapper may need to change, or be made configurable.

ekg commented 6 years ago

@photocyte I have tested this on real data. I have noisy reads and a string graph that's been built from them, and I can map them back with the expected level of divergence measured via the alignment score/identity. The alignment speed is not good, but from what I can tell the quality of the alignment is. I have a medium term project to improve long read alignment (and generally greatly improve alignment speed in vg), so hopefully that will improve things.