jltsiren / gcsa2

BWT-based index for graphs
MIT License
71 stars 11 forks source link

Support for non-de-Bruijn graphs? #2

Closed adamnovak closed 9 years ago

adamnovak commented 9 years ago

Looking through the code, it seems that there is only support for loading and indexing graphs composed of kmers of length 16 or less, where each kmer corresponds to a single unique node. There are some interesting tricks to pack that particular graph structure very small.

Is that the only kind of graph that the library is intended to be able to index? No mention of these constraints is made in the readme.

If not, when would there be support for graphs of more general topologies (for example, having the same 16-mer appear in 2 or more nodes)?

jltsiren commented 9 years ago

First a bit of background. Everything you index with the BWT is (almost) a de Bruijn graph. When you're indexing sequences of maximum length n, you're indexing a de Bruijn graph of order n or n+1, depending on how you handle the empty suffixes. There may multiple nodes with the same label, and various structural compression techniques can be used to merge redundant subgraphs, but fundamentally everything looks like a de Bruijn graph.

In the original GCSA, the input graph was a reverse deterministic finite automaton. Index construction started from the edges of the automaton. It joined the edges to build (implicit) de Bruijn graphs of order 1, 2, 4, 8, 16, ..., until it ended up with a graph that was both a de Bruijn graph and a finite automaton equivalent to the input automaton. Various structural compression tricks were used extensively to keep the graph small enough. Once the graph has been indexed, we can support path queries of arbitrary length in it.

In GCSA2, the input can be any labeled graph. Instead of starting with the edges, we start with paths of length k <= 16, which include enough information to reconstruct the input graph. We then use three doubling steps to increase the path length to 8k, and build a (structurally compressed) de Bruijn graph of order 8k. By indexing the de Bruijn graph, we can then support path queries of length up to 8k in the input graph.

The numbers involve various trade-offs. The larger k is, the less doubling steps we need to reach useful path lengths. Alternatively, we can support longer path queries without increasing index construction time. On the other hand, increasing k also increases the sizes of the input, the intermediate graphs, and the final index – sometimes exponentially. Extracting the input paths also requires more time with larger values of k. The limit k <= 16 is a matter of convenience, as it makes it possible to pack most of the path information in a single 64-bit integer. Increasing the number of doubling steps naturally increases the maximum query length, while also increasing memory usage and construction time.

adamnovak commented 9 years ago

So the order-k de Bruijn graph is an internal indexing construction, and the library actually supports operating on arbitrary labeled graphs (i.e. higher-order de Bruijn graphs), which it internally embeds in this order-k Bruijn graph (and associated, implicit order 8k de Bruijn graph of indexed strings)?

On Mon, Jun 22, 2015 at 3:12 PM, Jouni Siren notifications@github.com wrote:

First a bit of background. Everything you index with the BWT is (almost) a de Bruijn graph. When you're indexing sequences of maximum length n, you're indexing a de Bruijn graph of order n or n+1, depending on how you handle the empty suffixes. There may multiple nodes with the same label, and various structural compression techniques can be used to merge redundant subgraphs, but fundamentally everything looks like a de Bruijn graph.

In the original GCSA, the input graph was a reverse deterministic finite automaton. Index construction started from the edges of the automaton. It joined the edges to build (implicit) de Bruijn graphs of order 1, 2, 4, 8, 16, ..., until it ended up with a graph that was both a de Bruijn graph and a finite automaton equivalent to the input automaton. Various structural compression tricks were used extensively to keep the graph small enough. Once the graph has been indexed, we can support path queries of arbitrary length in it.

In GCSA2, the input can be any labeled graph. Instead of starting with the edges, we start with paths of length k <= 16, which include enough information to reconstruct the input graph. We then use three doubling steps to increase the path length to 8k, and build a (structurally compressed) de Bruijn graph of order 8k. By indexing the de Bruijn graph, we can then support path queries of length up to 8k in the input graph.

The numbers involve various trade-offs. The larger k is, the less doubling steps we need to reach useful path lengths. Alternatively, we can support longer path queries without increasing index construction time. On the other hand, increasing k also increases the sizes of the input, the intermediate graphs, and the final index – sometimes exponentially. Extracting the input paths also requires more time with larger values of k. The limit k <= 16 is a matter of convenience, as it makes it possible to pack most of the path information in a single 64-bit integer. Increasing the number of doubling steps naturally increases the maximum query length, while also increasing memory usage and construction time.

— Reply to this email directly or view it on GitHub https://github.com/jltsiren/gcsa2/issues/2#issuecomment-114283814.

ekg commented 9 years ago

@adamnovak We're actually indexing non de Bruijn graphs with GCSA2, so practically there is not an intrinsic limitation. What happens is that the input is de Brujin like, with the major difference being that kmers can appear multiple times. If you're interested, the input to GCSA2 can be built like this vg kmers -g -k 8 x.vg >x.gcsa2. You then build the GCSA using ./build_gcsa x. The input matches kmers to positions in an arbitrary graph, in this case a variation graph or simple string graph with variable-length nodes.

adamnovak commented 9 years ago

Hello,

Thanks for the information; I have gotten vg and gcsa2 (f472c8a28) both running, and it does indeed look like there is proper support for the sort of graphs I like.

However, the given example doesn't seem to work. I can get the kmers file x.gcsa2 from vg's test/small/x.fa, and it looks like this. But when I do build_gcsa x, it segfaults trying to index it. Here's what it looks like in gdb:

[anovak@kolossus gcsa2]$ gdb ./build_gcsa 
GNU gdb (GDB) 7.8.50.20140829-cvs
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build_gcsa...done.
(gdb) run x
Starting program: /cluster/home/anovak/build/gcsa2/build_gcsa x
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
GCSA builder

Input: x.gcsa2 (text format)
Doubling steps: 3

[New Thread 0x7ffff74eb700 (LWP 68308)]
GCSA::prefixDoubling(): Initial path length 8
  mergePaths(): 1010 -> 1010 paths
  mergePaths(): 964 unique, 46 unsorted, 0 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 8 -> 16)
  joinPaths(): Reserving space for 1010 paths
  joinPaths(): 1010 -> 1010 paths
  mergePaths(): 1010 -> 1010 paths
  mergePaths(): 1010 unique, 0 unsorted, 0 nondeterministic paths
GCSA::mergeByLabel(): 1010 -> 1010 paths

Program received signal SIGSEGV, Segmentation fault.
0x000000000043c9e8 in sdsl::sd_vector<sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >::sd_vector<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >)
    ()
(gdb) bt
#0  0x000000000043c9e8 in sdsl::sd_vector<sdsl::int_vector<(unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1> >::sd_vector<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > > >(__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, __gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >) ()
#1  0x000000000043d04e in gcsa::ValueIndex<std::pair<unsigned long, unsigned long>, gcsa::FirstGetter>::ValueIndex(std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > > const&) ()
#2  0x0000000000427d76 in gcsa::GCSA::sample(std::vector<gcsa::PathNode, std::allocator<gcsa::PathNode> >&, std::vector<std::pair<unsigned long, unsigned long>, std::allocator<std::pair<unsigned long, unsigned long> > >&) ()
#3  0x000000000042acea in gcsa::GCSA::GCSA(std::vector<gcsa::KMer, std::allocator<gcsa::KMer> >&, unsigned long, unsigned long, gcsa::Alphabet const&) ()
#4  0x0000000000408d78 in main ()
(gdb)

I've noticed that for some other kmers files, I can do the indexing with 1 or 2 doubling steps, but not 3.

Am I doing something wrong, or should I open a different issue for this problem?

jltsiren commented 9 years ago

In this case, you have somehow managed to extract a single sequence (a linear graph) from vg. Because there are no long repeats in the sequence, all paths have unique labels after a single doubling step. As a result, we don't have to store multiple id:offset samples for any sampled path. gcsa::ValueIndex apparently can't index an empty vector, because the construction of sdsl::sd_vector from an empty iterator range seems to crash.

The main issue is with vg, but there is also a minor issue in gcsa2, and possibly another minor issue in sdsl.

adamnovak commented 9 years ago

OK, I will try running it on some more complex test cases and/or have a go at empty-proofing it.

I made a linear "graph" with vg intentionally, because it's the simplest test case I could come up with; I would say vg is working correctly here.

On Thu, Jul 2, 2015 at 3:27 PM, Jouni Siren notifications@github.com wrote:

In this case, you have somehow managed to extract a single sequence (a linear graph) from vg. Because there are no long repeats in the sequence, all paths have unique labels after a single doubling step. As a result, we don't have to store multiple id:offset samples for any sampled path. gcsa::ValueIndex apparently can't index an empty vector, because the construction of sdsl::sd_vector from an empty iterator range seems to crash.

The main issue is with vg, but there is also a minor issue in gcsa2, and possibly another minor issue in sdsl.

— Reply to this email directly or view it on GitHub https://github.com/jltsiren/gcsa2/issues/2#issuecomment-118184523.

ekg commented 9 years ago

This is a nice test. I'm actually surprised it doesn't work. I guess this (paths all have unique labels) is an untested edge? On Jul 3, 2015 12:15 AM, "adamnovak" notifications@github.com wrote:

OK, I will try running it on some more complex test cases and/or have a go at empty-proofing it.

I made a linear "graph" with vg intentionally, because it's the simplest test case I could come up with; I would say vg is working correctly here.

On Thu, Jul 2, 2015 at 3:27 PM, Jouni Siren notifications@github.com wrote:

In this case, you have somehow managed to extract a single sequence (a linear graph) from vg. Because there are no long repeats in the sequence, all paths have unique labels after a single doubling step. As a result, we don't have to store multiple id:offset samples for any sampled path. gcsa::ValueIndex apparently can't index an empty vector, because the construction of sdsl::sd_vector from an empty iterator range seems to crash.

The main issue is with vg, but there is also a minor issue in gcsa2, and possibly another minor issue in sdsl.

— Reply to this email directly or view it on GitHub https://github.com/jltsiren/gcsa2/issues/2#issuecomment-118184523.

— Reply to this email directly or view it on GitHub https://github.com/jltsiren/gcsa2/issues/2#issuecomment-118190341.

jltsiren commented 9 years ago

Simple inputs are often the hardest to handle, because they contain special cases that don't happen with more realistic inputs. Most data structure implementations probably fail with an empty dataset. (And including runs of Ns in an assembled genome breaks absolutely everything.)