[Alignment] Multiple Sequence Alignment (MSA) and Alignment Graph

cartoonist commented 4 years ago

Question

I have just started using seqan3. I have worked with seqan2 though. I want to ask about multiple sequence alignment and alignment graphs in seqan3. I could not find anything about MSA in the documentation. Are they missing in the documentation or they are not implemented yet?

marehr commented 4 years ago

Hello @cartoonist!

We unfortunately haven't implemented a multiple sequence alignment, yet. But, we already had some discussions how to implement it in seqan3 and we definitely want to implement this feature (we also have some projects that need them). Currently, we are focusing on making the features that seqan3 already offers stable, e.g. the search and pairwise alignment.

Since you already have experience with seqan2, maybe you could help us design this feature by answering these questions:

What use-case do you have and how did you implement it with seqan2? I think we had several ways how to use the MSA so it would be interesting to see what you need!
Do you have publicly available code that we can look at? Or maybe you can provide some snippets?
How did you call the MSA algorithm and was there something that you thought was exceedingly hard to understand, or unnecessary verbose, or hard to use (code-wise with regards to user friendliness)?
How do you think should the interface look like? Maybe you have an idea of a slick and easy interface. Maybe you already peaked into seqan3 and have an idea how you expect it to look like. Maybe some expectation from other libraries or programming languages, like bio-python?

Thank you!

cartoonist commented 4 years ago

Thanks for your prompt reply.

I am currently looking for a way to get a "partial-order graph" representing MSA of a set of sequences. I came across seqan2 implementation of T-Coffee algorithm resulting an AlignmentGraph which is what I need. I was curious to see if seqan3 also provides it or not.

For implementing it in seqan2, I mostly followed the documentation to call globalMsaAlignment over an alignment graph which was initialised by a string set and I found it quite seqanic (I mean if one has worked with seqan2, it was quite straightforward and consistent with other parts of the library):

  using string_type = seqan::String< seqan::AminoAcid >;
  using stringset_type = seqan::StringSet< string_type >;
  using dependentset_type = seqan::StringSet< string_type, seqan::Dependent<> >;
  using alignment_type = seqan::Alignment< dependentset_type, void >;
  using graph_type = seqan::Graph< alignment_type >;

  auto scoring_scheme = seqan::Blosum62( -1, -11 );

  stringset_type sequences;
  //dependentset_type dsequences;
  while ( fasta_file >> record ) {
    appendValue( sequences, record.seq );
    //appendValue( dsequences, back( sequences ) );
  }

  graph_type align_graph( sequences );
  //graph_type align_graph( dsequences );

  globalMsaAlignment( align_graph, scoring_scheme );

The only part that I found user-unfriendly is that the graph type is only defined for the seqan::Dependent<> string set. As far as I understood, I have to store the sequences somewhere (here in a seqan::Owner<> string set sequences) before I could keep their pointers into another dependent string set. In the above snippet, I just passed an owner string set to the graph whereas the graph type defined for dependent string set type. I am not sure if it is a good practice or error-prone. The alternative is appending sequences to two string sets -- one owner and one dependent (commented lines in above example) which seems unintuitive to me.

Since I don't know about implementation details, I might be mistaken. But what I expect here is that the graph type should be defined for all string set types. In order to avoid unnecessary copy, it can internally keep a dependent string set initialised by the given string set which can be either a dependent or an owner.

Regarding the last point, I realised that seqan3 is quite different in design. Since I have not worked with it, I don't have an idea how the interface should look like in seqan3. However, broadly speaking, I prefer those APIs close to the standard library, such as sdsl-lite. For example, heavy use of interface functions makes the library hard to navigate and, consequently, hard to extend because one has to look for the overloaded functions through lots of header files to figure out what is happening under the hood. I personally prefer member functions for basic operations such as container.size() over length(container) and interface function, e.g. dfs(graph, ...), for higher level operations. Indeed, this is based on my experience with seqan2, not 3.

I hope I could answer your points.

marehr commented 4 years ago

Thank you for the snippet, can you show me how you process the constructed align_graph? And how you use the partial-ordering? Do you need the real graph for your use case, or would a certain order of the output suffice?

Just some answers from the points that I skimmed over:

We got rid of a lot of those copy avoiding structures by using ranges; a new standard library feature since C++20 (https://en.cppreference.com/w/cpp/ranges)
Free functions (or function objects) are unfortunately the current direction the standard is going. We made sure that most of our new data types and data structures have member functions, but to allow interoperability with builtin or custom types like char, int and so on, we provide most operations via customization point objects (CPOs), too, which are a new kind of (standardized) function objects since C++20 https://eel.is/c++draft/customization.point.object.
One of our main goals for seqan3 was to be much more close to the standard. Most of our containers are STL-like and thus self-contained in one header file / class.
However, we still kept the semantics that an algorithm is a function invocation, like the STL does.

marehr commented 4 years ago

Hello @cartoonist,

good news! We started to work on a design proposal for multiple sequence alignment (https://github.com/seqan/product_backlog/issues/104). Since we would like to include your use case, it would be awesome if you could answer my last questions:

Thank you for the snippet, can you show me how you process the constructed align_graph?
And how do you use the partial-ordering?
Do you need the real graph for your use case, or would a certain order of the output suffice?

Best regards Marcel

marehr commented 4 years ago

Hi @cartoonist,

it would be awesome if you could take a couple of minutes and answer my questions.

Thank you!

cartoonist commented 4 years ago

Hello @marehr,

Really sorry for my late response! I wanted to get back to you with a concrete example/snippet. But for now, please let me update you with my current objective/use case, later I will post some examples or snippets if needed.

I would like to create a pan-gnome graph from a multiple sequence alignment without flattening the MSA; e.g. dumping it to a FASTA file. This graph itself can be used independently or can be embedded in a larger graph (e.g. a whole genome graph).

I still could not figure out how to construct a partial order graph using in seqan2. I did try to visualise the alignment graph constructed by seqan. It seems that nodes corresponding to alignment segments are connect together and the resulting graph is a forest of complete graphs. That was different from what I expected as a partial order graph. Right now, as temporary workaround, I just dump the MSA in FASTA format then use an external Python script converting FASTA to GFA.

Assume that I have a partial order graph, what I need is, external or internal (for_each_*), iterators over nodes, edges, and (ideally) paths (original sequences) in the graph.

P.S. Thanks for the pointers to seqan3 in your previous reply.

marehr commented 4 years ago

Awesome!

I still could not figure out how to construct a partial order graph using in seqan2. I did try to visualise the alignment graph constructed by seqan. It seems that nodes corresponding to alignment segments are connect together and the resulting graph is a forest of complete graphs.

We have those two images in our seqan2 docs

and

The title says: An alignment graph with 3 sequences.

From what I understood the "alignment graph" stores all shared text segments (according to the alignment) in a node and has multiple edges to nodes (It compresses nodes of characters into nodes of sub-sequences). I never worked with it, so I'm not to sure of the internal representation. Does this resemble your partial order graph, or do you have some drawing of 2 / 3 sequences that depicts your use case?

Assume that I have a partial order graph, what I need is, external or internal (foreach*), iterators over nodes, edges, and (ideally) paths (original sequences) in the graph.

You say "ideally", does this mean that all enumerated sub-paths would suffice for you, or do you want that as a convenience function that gives you the sub-sequence up until that node?

rrahn commented 4 years ago

AFAIK, the graph is a partially ordered with respect to a sequence. This is resembled by the directed edges connecting nodes one one level (representing one sequence). In addition, you have the undirected edges connecting nodes (sequence segments) between the sequences (IIRC than PO graph would make them to one node, right?). Our graph is kind of a generalisation where one node represents exactly one character. Our SeqAn2 graph implementation offers all kind of traversals. So I think it should be possible to do what you want. But I have to be honest, I am also not familiar with the interfaces and would need to dig deeper to figure out how you can traverse along specific paths. Here is a tutorial that might offer some help with this for now. https://seqan.readthedocs.io/en/develop/Tutorial/DataStructures/Graphs.html#graph-iterators. Another interesting topic that might be relevant is that there was a publication that describes how the alignment graph can be converted into another pan-genome graph representation. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-99 Not sure if you knew this already, but maybe that also offers some more information. Then we could use these insights to derive the design for the new implementation. I think, pan-genome analysis together with our work on compressive genomics will be quite the nice addition to our library.

rlorigro commented 3 years ago

Hi, I am also interested in Partial Order Alignment, or just progressive sequence to graph alignment. For the current version of Seqan3, is the path information stored for each sequence aligned to the graph?

Of course it is possible to traverse the graph and chose any arbitrary path that matches the sequence ad hoc, but because POA is order dependent (aka progressive or "greedy"), there can be cases where the addition of branches in the graph can change successive alignments of the same sequence. This means the ad hoc method is not sufficient to find the actual path an alignment took.

Alternatively, if any MSA algorithm can be converted to a graph with path information, that would be nice too.

smehringer commented 2 years ago

We are also tracking MSA design decisions here: https://github.com/seqan/product_backlog/issues/104

seqan / seqan3

[Alignment] Multiple Sequence Alignment (MSA) and Alignment Graph #1763

Question