seqan / product_backlog

This repository is used as product backlog for all SeqAn relevant backlog items. This is intended to organise the work for the team.
2 stars 1 forks source link

[MSA] Multiple Sequence Alignment using Tcoffee #104

Open smehringer opened 4 years ago

smehringer commented 4 years ago

Description

We basically want to port the seqan2 code to seqan3 that lets us compute an MSA.

Based on the structures in seqan2, this could be the structure for seqan3:

MSA2

Easy user interface

    std::vector<dna4_vector> sequences{...};
    auto result = seqan3::align_multiple(sequences, config); // result = range of aligned sequences
    // print result to .aln

To avoid a huge configuration object, the easy user interface should come with a lot of defaults. If the user wants to use a highly customizable MSA, they need to do the steps themselves.

Advanced user interface:

    std::vector<dna4_vector> sequences{...};

    // this is also basically what the body of align_multiple will do
    // -------------------------------------------------------------------------------
    seqan3::fragment_library library{};
    seqan3::symmetric_matrix distances{0.0};

    // "segment match generation", e.g. global (also add local the same way)
    for (auto res : seqan3::align_pairwise(seqan3::views::pairwise_combine(sequences)))
    {
         library.add(res);
         distances[res.seq1_id(), res.seq2_id()] = res.score());
    }

    auto graph = seqan3::refine_and_build_graph(library); // one function in seqan2, maybe two for seqan3?

    seqan3::extend_triplets(graph); // refine graph

    auto tree = seqan3::upgma(distances);
    auto alignment_graph = seqan3::graph_align_progressively(graph, tree);    
    // -------------------------------------------------------------------------------
    auto aligned_sequences = seqan3::to_aligned_sequences(alignment_graph); // ?
    std::vector<dna4_vector> sequences{...};

    // this is also basically what the body of align_multiple will do
    // -------------------------------------------------------------------------------
    seqan3::fragment_library library{};
    seqan3::symmetric_matrix distances{0.0};

    // "segment match generation", e.g. global (also add local the same way)
    for (auto res : seqan3::align_pairwise(seqan3::views::pairwise_combine(sequences)))
    {
         library.add(res);
         distances[res.seq1_id(), res.seq2_id()] = res.score());
    }

    seqan3::some_graph graph{library}; // one function in seqan2, maybe two for seqan3?
    graph.extend_triplets(); // refine graph, may be done directly in constructor

    auto tree = seqan3::upgma(distances);
    auto alignment_graph = seqan3::graph_align_progressively(graph, tree);    
    // -------------------------------------------------------------------------------
    auto aligned_sequences = seqan3::to_aligned_sequences(alignment_graph); // ?

Dependencies

that need to be present in seqan3 before Tcoffee can be implemented.

new stuff that is independent from the actual tcoffee stuff:

Notes

smehringer commented 4 years ago

There is also a first feature request from an external user: seqan/seqan3#1763 Based on what I read, he needs the alignment graph as output, not just the aligned sequences.

smehringer commented 2 years ago

From @joergi-w (https://github.com/seqan/seqan3/issues/2281)

Feature request

Support the full specification of the T-COFFEE_LIB_FORMAT_01 such that the t-coffee application can read in the t-coffee library format with ranges of residues. Note at the moment only alignment information given as single residue pair is supported, which is quite limiting as many programs can write out the extended format version.

To the developers: In SeqAn2 there exists only a partial implementation of this format, as the ranges are not supported (BadLexicalCast error).

We received a feature request for this by a user of SeqAn2's T-Coffee app and I'm posting it here such that it will be considered for SeqAn3.

smehringer commented 2 years ago

See also feature request https://github.com/seqan/seqan3/issues/2674