vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

Unexpected duplicate regions on the graph #2147

Open kevsilva opened 5 years ago

kevsilva commented 5 years ago

Hi,

I am using vg to create a variation graph from 2 E. coli genomes : O104 (NC_018658.1) and IAI39 (NC_011750.1). To validate the graph (and also practice the use of vg), I generated reads (with my own script, not vg sim) without errors from O104 and mapped them on the graph.

One of my reads ("AAAAAAGCCGGCCGTTGGGCCGGCTCTTTTACTTACATCACCAGACGGCGAATGTCAGACAGCGTGTTGTTAATGATGGTAATGAAACGGGCACCATCAGCACCGTCGATCACGCGGTGGTCGAAGGAGAGAGAAATCGGCAGCATCAGACGCGGCACGAACTCTTTACCATTCCACACCGGCTCCATCGCGGACTTGGA") matched perfectly on two regions of the graph. By looking more closely onto these two regions extremities, it seems like it is exactly the same region. I was then expecting that the two genomes would align there and that the graph would represent this region as a single region on the graph and not duplicate it like there was no alignment.

Am I missing something? or is this an actual issue? I feel like part of the issue is because of the length of the genomes, since I tried to create the same graph but with truncated versions of the genomes and here I had only one match crossing the two "paths" (for this specific read, I did not check for the other reads and other potential unexpected duplications).

At first, I used a very simple code:

vg msga -f ecoli_O104_IAI39.fa -a > O104_IAI39_test.vg
vg index -x O104_IAI39_test.xg O104_IAI39_test.vg
vg prune O104_IAI39_test.vg > O104_IAI39_test.pruned.vg
vg index -g O104_IAI39_test.gcsa O104_IAI39_test.pruned.vg
vg map -D -T myread.fasta -x O104_IAI39_test.xg -g O104_IAI39_test.gcsa -M 10 > O104_IAI39_myread.gam

Then I tried to change several parameters of msga (-k, -w, -B, and the local alignment scores), but this read still mapped twice.

Thanks in advance for your help.

ekg commented 5 years ago

It sounds to me that the graph built with msga doesn't have the right structure. This could mean that the constructive alignment is failing over some part of the graph. It's good to get a picture of what's going on. You can check the structure of the graph in two ways. For such small graphs, it is easy enough to use Bandage. You can also run vg dotplot on the graph and then plot the result using a stats package or rendering process like datashader. It's also possible to run vg viz.

I would definitely try to build a circular graph for this genome. You can tell msga to do this:

-> % vg msga -h 2>&1 | grep -i circular
    -Z, --circularize       the input sequences are from circular genomes, circularize them after inclusion

The overall structure of the alignment is inferred by a linear process (a subalignment chain DAG) and this may still result in misalignment. But you should definitely be building a circular graph.

Lately, I have been working a lot on the particular problem of building graphs from whole genome sequences. This has resulted in several new tools. The one that is most applicable here is the alignment to variation graph inducer seqwish. The idea with seqwish is to convert a set of alignments directly to a graph without any filtering or other manipulation of the alignment set. The method uses disk-backed sorted arrays to help it scale to very large problem sizes. You can try this to try/configure a different alignment process and see if that helps.

To test this out, replace your msga call with a minimap2/seqwish process like this:

# run an alignment ... you may need to change these parameters
minimap2 ecoli_O104_IAI39.fa  ecoli_O104_IAI39.fa -c -x asm20 >ecoli_O104_IAI39.paf    
# generate the variation graph implied by the alignments
seqwish -s ecoli_O104_IAI39.fa -p ecoli_O104_IAI39.paf -b ecoli_O104_IAI39.paf -g ecoli_O104_IAI39.gfa
# convert the graph to .vg format
vg view -Fv ecoli_O104_IAI39.gfa >ecoli_O104_IAI39.vg 

As before, you'll probably want to try to visualize the graph to understand what structure it made, and if that matches your expectations.

Let us know what worked, and what you find was causing this problem in your case. It'll be helpful to others. If you can't resolve this problem, I'm happy to check out a test case if you're able to share it. It may indicate a bug in the alignment process.

kevsilva commented 5 years ago

Thanks for the answer. Sadly I could not make it work, except for the circularization.

Bandage: My computer crash when I try to load the graph (it works fine with the example so I am assuming it is because my graph is too large).

vg dotplot: I get an error.

terminate called after throwing an instance of 'xg::XGFormatError'
  what():  Unimplemented XG format version: 0
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_DkJ7X6/stacktrace.txt

vg viz: I cannot open the created svg file. Using chromium I get the following error. error on line 121 at column 18: Extra content at the end of the document

Circularization: Done. However there is no difference (considering the mapping of my specific read, I cannot tell about the general structure since I could not make the visualization work).

seqwish: I could not install it. I get the following error. /usr/bin/ld: cannot find -lz Despite having libz installed. If it can help for further answers, I am in fact not sure I will work with complete genomes, most likely just with reads/contigs/long-reads. I used complete genomes thinking it would be easier and to be sure to have a complete graph that I know well. Now I am wondering if I shouldn't simulate contigs of these genomes and construct the graph with these? However I am still curious about this issue, to understand where does it come from and be assured that I will not experience it again even with reads/contigs/long-reads in the future.

Again thanks for your time. As a test case you can use exactly the one I present, the 2 E. coli complete genomes available on the NCBI which accession numbers are NC_018658.1 and NC_011750.1, and the read I mentioned in my first message.

adamnovak commented 5 years ago

I've made a PR that should fix the issue you found with dotplot: #2152

I'm not sure what's wrong with vg viz.

For getting seqwish working, do you have the development files for zlib (what would be libz-dev on Ubuntu)? We're probably looking for libz.a and not just libz.so.

kevsilva commented 5 years ago

Indeed vg dotplot now works, thank you. I am on Fedora and libz is installed, I will check in more details...

Eventually Bandage works half the time so I could draw my graph. It looked a little bit like a flower with a part with many loops and another part which seemed completely linear, which can be coherent with a missed alignment and why I have a duplicate region.

In the mean time, I decided to split the 2 complete genomes into 3 overlapping parts and run msga on that. It seems like I managed to by-pass my issue, at least for my specific read since there is now only one alignment. And with Bandage I can see a more complex graph with loops almost everywhere. However now instead of having 2 paths, I have 6. Is there a way to merge paths?

And back to the initial issue, would you say that vg (or at least vg msga) is eventually not suited for complete genomes and seqwish should be used instead in those cases?

Thanks!