lh3 / gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
208 stars 20 forks source link

Standalone libraries to work with GFA #3

Open lh3 opened 5 years ago

lh3 commented 5 years ago

Another discussion thread. It is probably too early to implement libraries now, but it would be good to start thinking about the topic.

Currently, gfatools comes with very preliminary APIs to read rGFA into memory. The memory layout is described in gfa.h. It largely follows the model of string graphs. I quite like model and will stick with it. However, I guess general devs will feel uncomfortable with this representation. I won't have the bandwidth to implement the more general path model any time soon, either. In addition, it is also preferable to have two independent implementations (e.g. samtools vs picard vs bamtools). I wonder if you (@ekg and @benedictpaten) are interested in implementing a standalone library to work with GFA. You already have in vg a GFA parser, an in-memory model and a serialization format. You can isolate the relevant code and expose stable C and C++ APIs to other devs. I know vg has APIs, but I guess other devs will prefer a more focused lightweight library that is easier to build.

ekg commented 5 years ago

@lh3, we have developed a library to provide a standard interface to sequence graphs with embedded paths, https://github.com/vgteam/libhandlegraph.

The idea with this interface hierarchy is to expose something based on a few primitive types without needing to implement the data structure using those types. For instance, we often represent graphs using fully succint data structures, but this means that entities in the graph can't be represented as pointers to nodes or or atomic IDs. The handle concept refers to the bidirectional identifier used by a particular implementation to refer to a node (S line) in the graph.

The class hierarchy includes immutable sequence graphs, graphs with paths (VG model), and mutable versions of them. It also exposes a positional index based on the embedded paths.

Two implementations are based on reading GFA files into a self index and exposing aspects of this API on top of them (xg and odgi). We have a study in progress to compare implementations.

It should be easy enough to add a simpler fixed C and C++ interface on top of these. I don't think the semantics become radically different. There is a mismatch with the number of coordinate spaces. There are some semantic mismatches with rGFA, but they can be resolved.

lh3 commented 5 years ago

An important question is about the scope of the library. vg is too large. I think in its current form, libhandlegraph is too small. My preference is to include at least a GFA parser and an in-memory data structure like handle graph. I don't have a strong opinion on serialization, indexing and other stuffs.

Another question is about the terminology. The use of "(sequence) segment" and "link" can be traced back to the discussion on the FASTG format. Richard and I wanted to avoid "vertex", "node", "edge" and "arc" because in the assembly world, people always have different opinions. In a de Bruijn graph, "vertex" and "edge" are interchangeable to some extent, and as a result, a graph simplified from a de Bruijn graph is more often represented in the "edge way", with sequences put on edges instead of nodes. Adopting the GFA terminology will help to avoid such confusions.

ekg commented 5 years ago

For clarity, we are rewriting all of vg to be based around libhandlegraph. Version 2 will arrive when this transition is done.

I think we should consider extending the HandleGraph interfaces to match what you are thinking of. Then we can peg a C interface to it. The benefit is that the backend piece that stores and allows manipulation of the graph can be changed. We have the impression that there is not one best solution on this side, but it has helped a lot to specify a small API to these graphs.

Libhandlegraph is missing anything to do with alignment. It might make sense to mix this in somehow. In vg we had other primitives but we shouldn't be stuck on them.

On Thu, Jul 18, 2019, 19:11 Heng Li notifications@github.com wrote:

An important question is about the scope of the library. vg is too large. I think in its current form, libhandlegraph is too small. My preference is to include at least a GFA parser and an in-memory data structure like handle graph. I don't have a strong opinion on serialization, indexing and other stuffs.

Another question is about the terminology. The use of "(sequence) segment" and "link" can be traced back to the discussion on the FASTG format. Richard and I wanted to avoid "vertex", "node", "edge" and "arc" because in the assembly world, people always have different opinions. In a de Bruijn graph, "vertex" and "edge" are interchangeable to some extent, and as a result, a graph simplified from a de Bruijn graph is more often represented in the "edge way", with sequences put on edges instead of nodes. Adopting the GFA terminology will help to avoid such confusions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lh3/gfatools/issues/3?email_source=notifications&email_token=AABDQEJUIMWIODN3NFYMZUDQACP3DA5CNFSM4IE5OQU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2JFDPQ#issuecomment-512905662, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEJQFP77CRFQV6MVFLTQACP3DANCNFSM4IE5OQUQ .

bricoletc commented 2 years ago

How about https://github.com/edawson/gfakluge ? Though i don't think it supports rGFA.