[MSA] Multiple Sequence Alignment using Tcoffee

smehringer commented 4 years ago

Description

We basically want to port the seqan2 code to seqan3 that lets us compute an MSA.

Based on the structures in seqan2, this could be the structure for seqan3:

MSA2

Easy user interface

    std::vector<dna4_vector> sequences{...};
    auto result = seqan3::align_multiple(sequences, config); // result = range of aligned sequences
    // print result to .aln

To avoid a huge configuration object, the easy user interface should come with a lot of defaults. If the user wants to use a highly customizable MSA, they need to do the steps themselves.

Advanced user interface:

    std::vector<dna4_vector> sequences{...};

    // this is also basically what the body of align_multiple will do
    // -------------------------------------------------------------------------------
    seqan3::fragment_library library{};
    seqan3::symmetric_matrix distances{0.0};

    // "segment match generation", e.g. global (also add local the same way)
    for (auto res : seqan3::align_pairwise(seqan3::views::pairwise_combine(sequences)))
    {
         library.add(res);
         distances[res.seq1_id(), res.seq2_id()] = res.score());
    }

    auto graph = seqan3::refine_and_build_graph(library); // one function in seqan2, maybe two for seqan3?

    seqan3::extend_triplets(graph); // refine graph

    auto tree = seqan3::upgma(distances);
    auto alignment_graph = seqan3::graph_align_progressively(graph, tree);    
    // -------------------------------------------------------------------------------
    auto aligned_sequences = seqan3::to_aligned_sequences(alignment_graph); // ?

    std::vector<dna4_vector> sequences{...};

    // this is also basically what the body of align_multiple will do
    // -------------------------------------------------------------------------------
    seqan3::fragment_library library{};
    seqan3::symmetric_matrix distances{0.0};

    // "segment match generation", e.g. global (also add local the same way)
    for (auto res : seqan3::align_pairwise(seqan3::views::pairwise_combine(sequences)))
    {
         library.add(res);
         distances[res.seq1_id(), res.seq2_id()] = res.score());
    }

    seqan3::some_graph graph{library}; // one function in seqan2, maybe two for seqan3?
    graph.extend_triplets(); // refine graph, may be done directly in constructor

    auto tree = seqan3::upgma(distances);
    auto alignment_graph = seqan3::graph_align_progressively(graph, tree);    
    // -------------------------------------------------------------------------------
    auto aligned_sequences = seqan3::to_aligned_sequences(alignment_graph); // ?

Dependencies

that need to be present in seqan3 before Tcoffee can be implemented.

[ ] lemon as a graph library
[x] global seqan3::align_pairwise with traceback
[ ] local waterman eggert seqan3::align_pairwise
[ ] Alignment I/O for alignments of more than 2 sequences (clustal format, aligned fasta)

new stuff that is independent from the actual tcoffee stuff:

[ ] seqan3::symmetric_matrix (stores only upper/lower triangle of the matrix as a string and maps indices to correct position on access)
[ ] seqan3::upgma (see wikipedia) Input: symmetric_matrix of distances
output: guide tree (either a lemon graph or just a std::set? needs thinking) The output tree needs to be traversable in reverse BFS order, lemon ListDiGraph supports BFS, so it should support the reverse. (see lemon api). And on each node it needs to tell us, which to sequences (sets) to align next.
[ ] seqan3::fragment_library Basically a string or set of fragments/segments, where a fragment/segment is defined es <seq1_id, seq2_id, seq1_begin_pos, seq2_begin_pos, len, score> (struct or tuple). It should have a member function that internally adds fragments based on a pairwise alignment result. It should be iterable (iterate over fragments). TODO: Check what other requirements are needed on the list of fragments. In seqan2 it is just stored as String<Fragment<>> but maybe we can make this more efficient by storing a unordered_map.

Notes

Yes we do want the extra data structure seqan3::fragment_library because implementing the functionality directly within a graph would be very complex. If we use the seqan3::fragment_library we can derive the code more easily from the existing SeqAn2 code which will save us lots of time! So we prefer this over "There is one less data structure". The extra data structure will also allow us to do 2 individual steps, which can be tested (seqan2 tests). If we do it directly in the graph structure, we need to do the segment generation and segment refinement at once, which is more complex (more possible errors) and harder to test.
Before we discuss the graph structures that we expose (seqan3 wrappers or just lemon graphs) we really need to see what functionality is needed from the graph in the functions extend_triplets and progressive_alignment. Those are way over 3000 lines of codes. So this will take time.

smehringer commented 4 years ago

There is also a first feature request from an external user: seqan/seqan3#1763 Based on what I read, he needs the alignment graph as output, not just the aligned sequences.

smehringer commented 2 years ago

From @joergi-w (https://github.com/seqan/seqan3/issues/2281)

Feature request

Support the full specification of the T-COFFEE_LIB_FORMAT_01 such that the t-coffee application can read in the t-coffee library format with ranges of residues. Note at the moment only alignment information given as single residue pair is supported, which is quite limiting as many programs can write out the extended format version.

To the developers: In SeqAn2 there exists only a partial implementation of this format, as the ranges are not supported (BadLexicalCast error).

We received a feature request for this by a user of SeqAn2's T-Coffee app and I'm posting it here such that it will be considered for SeqAn3.

smehringer commented 2 years ago

See also feature request https://github.com/seqan/seqan3/issues/2674

seqan / product_backlog