Open smehringer opened 4 years ago
There is also a first feature request from an external user: seqan/seqan3#1763 Based on what I read, he needs the alignment graph as output, not just the aligned sequences.
From @joergi-w (https://github.com/seqan/seqan3/issues/2281)
Support the full specification of the T-COFFEE_LIB_FORMAT_01 such that the t-coffee application can read in the t-coffee library format with ranges of residues. Note at the moment only alignment information given as single residue pair is supported, which is quite limiting as many programs can write out the extended format version.
To the developers: In SeqAn2 there exists only a partial implementation of this format, as the ranges are not supported (BadLexicalCast error).
We received a feature request for this by a user of SeqAn2's T-Coffee app and I'm posting it here such that it will be considered for SeqAn3.
See also feature request https://github.com/seqan/seqan3/issues/2674
Description
We basically want to port the seqan2 code to seqan3 that lets us compute an MSA.
Based on the structures in seqan2, this could be the structure for seqan3:
Easy user interface
To avoid a huge configuration object, the easy user interface should come with a lot of defaults. If the user wants to use a highly customizable MSA, they need to do the steps themselves.
Advanced user interface:
Dependencies
that need to be present in seqan3 before Tcoffee can be implemented.
new stuff that is independent from the actual tcoffee stuff:
seqan3::symmetric_matrix
(stores only upper/lower triangle of the matrix as a string and maps indices to correct position on access)seqan3::upgma
(see wikipedia) Input: symmetric_matrix of distancesoutput: guide tree (either a lemon graph or just a std::set? needs thinking) The output tree needs to be traversable in reverse BFS order, lemon ListDiGraph supports BFS, so it should support the reverse. (see lemon api). And on each node it needs to tell us, which to sequences (sets) to align next.
seqan3::fragment_library
Basically a string or set of fragments/segments, where a fragment/segment is defined es <seq1_id, seq2_id, seq1_begin_pos, seq2_begin_pos, len, score> (struct or tuple). It should have a member function that internally adds fragments based on a pairwise alignment result. It should be iterable (iterate over fragments). TODO: Check what other requirements are needed on the list of fragments. In seqan2 it is just stored asString<Fragment<>>
but maybe we can make this more efficient by storing aunordered_map
.Notes
Yes we do want the extra data structure
seqan3::fragment_library
because implementing the functionality directly within a graph would be very complex. If we use theseqan3::fragment_library
we can derive the code more easily from the existing SeqAn2 code which will save us lots of time! So we prefer this over "There is one less data structure". The extra data structure will also allow us to do 2 individual steps, which can be tested (seqan2 tests). If we do it directly in the graph structure, we need to do the segment generation and segment refinement at once, which is more complex (more possible errors) and harder to test.Before we discuss the graph structures that we expose (seqan3 wrappers or just lemon graphs) we really need to see what functionality is needed from the graph in the functions
extend_triplets
andprogressive_alignment
. Those are way over 3000 lines of codes. So this will take time.