seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
406 stars 82 forks source link

several same-score pairwise alignments #3198

Open notestaff opened 11 months ago

notestaff commented 11 months ago

Platform

Linux bpb23-acc 5.4.0-136-generic #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Question

In pairwise_align(), when several possible alignments have the same score, how does the code pick which one to return? Can it be asked to return all of them (lazily), like BioPython's Bio.Align.PairwiseAligner does? Thanks!

@eseiler

eseiler commented 11 months ago

Hey @notestaff,

if there are multiple valid alignments, the traceback will choose in this order:

  1. Diagonal (match)
  2. Vertical (gap in the first sequence)
  3. Horizontal (gap in the second sequence)


The alignments are always evaluated lazily, i.e., they are only computed if you access them.

As far as I am aware, there is no way to force an output of all possible (alternative) alignments.

Click to show example ```cpp #include #include #include #include #include #include int main() { using namespace seqan3::literals; std::vector const sequence_1{"ACGT"_dna4}; std::vector const sequence_2{"AGCT"_dna4}; auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme; for (auto const & result : seqan3::align_pairwise(std::tie(sequence_1, sequence_2), config)) seqan3::debug_stream << result << '\n'; } ``` Will output: ``` {sequence1 id: 0, sequence2 id: 0, score: -2, begin: (0,0), end: (4,4), alignment: 0 . A-CGT | | | AGC-T } ``` An alternative is: ``` ACG-T | | | A-GCT ``` but the algorithm will prefer putting the gap in the first sequence first.


@rrahn will also have a look at your questions and correct me/expand on my answers if necessary :)

What would the use case be for iterating over all alternative alignments? Please don't hesitate to ask more questions if something is unclear.

notestaff commented 11 months ago

Hi @eseiler and @rrahn,

Thanks for the response (and for seqan generally). My current use case involves converting some code from python/biopython to C++/seqan3, and trying to ensure that the new code exactly reproduces the old results. I've realized now that the code path which used alignment enumeration is inactive, so don't have an immediate need for that function. However, it would help a lot if there was a systematic way to exactly reproduce, in seqan3, biopython's choice of alignment from among the top-scoring alignments. Maybe, the order in which the traceback chooses directions could be explicitly parameterized (with compile-time choices)? Then I could choose the order corresponding to biopython's implementation.

rrahn commented 11 months ago

Hi @notestaff,

thanks for writing us. If I understand you correctly, then you are interested in all cooptimal alignments that yield an optimal alignment score? The short answer is: It is indirectly possible by forcing align_pairwise(...) to output the trace matrix:

#include <seqan3/alignment/configuration/align_config_debug.hpp>
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/align_config_method.hpp>
#include <seqan3/alignment/configuration/align_config_scoring_scheme.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/matrix/detail/trace_directions.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>

auto align_config = seqan3::align_cfg::method_global{} | 
                    seqan3::align_cfg::scoring_scheme{
                        seqan3::nucleotide_scoring_scheme{seqan3::match_score{4},
                                                          seqan3::mismatch_score{-5}}
                    } |
                    seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-10},
                                                       seqan3::align_cfg::extension_score{-1}
                    } |
                    seqan3::align_cfg::detail::debug{};

auto source = "AACCGGTTAACCGGTT"_dna4;
auto target = "ACGTACGTA"_dna4;

using trace_matrix_t = seqan3::detail::two_dimensional_matrix<std::optional<seqan3::detail::trace_directions>>; 
auto alignments = seqan3::align_pairwise(std::tie(source, target), align_config);

for (auto && alignment : alignments) {
    trace_matrix_t trace_matrix{alignment.trace_matrix()};
    // in case you are using local alignments you can use 
    // alignment.sequence1_end_position()/alignment.sequence2_end_position() to find the column and row index where the alignment starts in the trace matrix and 
    // alignment.sequence1_begin_position()/alignment.sequence2_begin_position() to find the column and respectively row index where the alignment ends in the trace matrix.
    // you can then implement a traversal over the trace matrix to enumerate all paths between the end and begin point. 
}

Note that this soulution, however, is not implemented with efficiency in mind but thought as a debug feature to evaluate the correctness of the computed alignments. Still it would give you an entry point to replicate the desired BioPython functionality. You can find the corresponding API documentation of the trace_matrix in the developer API: http://docs.seqan.de/seqan3/3-master-dev/classseqan3_1_1detail_1_1two__dimensional__matrix.html.

If anything is unclear or does not work as expected, just contact us.

ChristopherAdelmann commented 4 months ago

Hi @rrahn and @eseiler,

Thanks for that suggestion. I am trying to extract all cooptimal alignments for local alignments from the debug trace matrix. Here, I try to utilize the seqan3::detail::trace_iterator to construct the trace paths from all max score indices obtained from the score matrix using customized methods:

auto tracePath(const seqan3::detail::matrix_coordinate &trace_begin,
                   const seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>
                       &complete_matrix) {
        using matrix_t = seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>;
        using matrix_iter_t = std::ranges::iterator_t<matrix_t const>;
        using trace_iterator_t = seqan3::detail::trace_iterator<matrix_iter_t>;
        using path_t = std::ranges::subrange<trace_iterator_t, std::default_sentinel_t>;

        seqan3::detail::matrix_offset const offset{trace_begin};

        return path_t{trace_iterator_t{complete_matrix.begin() + offset}, std::default_sentinel};
    };

template <typename sequence_pair_t, typename score_t, typename matrix_coordinate_t>
AlignmentResult makeResult(const sequence_pair_t &sequencePair, score_t score,
                               matrix_coordinate_t endPositions, auto const &alignmentMatrix) {
        const size_t elementsN = alignmentMatrix.rows() * alignmentMatrix.cols();
        std::vector<seqan3::detail::trace_directions> traceDirections;
        traceDirections.reserve(elementsN);

        for (const auto &direction : alignmentMatrix) {
            traceDirections.push_back(direction.value_or(seqan3::detail::trace_directions::none));
        }

        seqan3::detail::number_rows rowsN{alignmentMatrix.rows()};
        seqan3::detail::number_cols colsN{alignmentMatrix.cols()};

        seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions> traceMatrix(
            rowsN, colsN, traceDirections);

        using std::get;
        seqan3::detail::aligned_sequence_builder builder{get<0>(sequencePair),
                                                         get<1>(sequencePair)};
        auto trace_result = builder(CoOptimalPairwiseAligner::tracePath(endPositions, traceMatrix));

        return {score,
                std::make_pair(trace_result.first_sequence_slice_positions.first,
                               trace_result.second_sequence_slice_positions.first),
                std::make_pair(endPositions.row, endPositions.col)};
    };

This works sometimes, but at some point I got the error: Assertion failed: (current_direction == trace_directions::diagonal), function operator++, file trace_iterator_base.hpp, line 169..

After digging, I discovered that the internal trace matrix is modified before being returned as a debug matrix. This seems to cause the issue. After removing these lines, it worked fine: https://github.com/seqan/seqan3/blob/bcc142e425fd6d201c207827e4bec2a072c47d06/include/seqan3/alignment/pairwise/alignment_algorithm.hpp#L740-L763

So my questions are twofold: is there any way to obtain the unmodified trace matrix without modifying your source code, and are there any plans for outputting the cooptimal alignments in the future?

Thanks again!

rrahn commented 4 months ago

Hi @ChristopherAdelmann thanks for digging into this. Is there any chance you can provide us minimum working example that fails? Right now, it is hard for me to tell what exactly causes the bug you are experiencing. I would need to investigate why this modification of the debug tracematrix causes your assertion. To your first question, I am afraid, there is no simple way to disable this in the current version, except that I might be able to fix this when I can reproduce the bug. To you second question, yes, we are going to have this in a future versio of SeqAn. Not sure when this will be ready, as it comes with some major changes of the library modularisation in the back.

ChristopherAdelmann commented 4 months ago

Hi @rrahn,

of course, here is some sample code that reproduces the issue. I guess the scoring scheme is not strictly necessary, but it was easier for me to reproduce it this way. Additionally, I was wondering whether it is necessary to return the trace matrix as a vector of optionals or if it is possible to return the non-optional version.

#include <seqan3/alphabet/nucleotide/dna5.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/matrix/detail/aligned_sequence_builder.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/matrix/detail/matrix_coordinate.hpp>
#include <seqan3/alignment/matrix/detail/trace_directions.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>

void testCooptimalAlignments() {
        using namespace seqan3::literals;
        auto source = "GTTCGCTAGTTTCGCGCGTAGTTTCTCGTTCG"_dna5;
        auto target = "CTTCTACTTTGTTCACTTTGATAGCAAACTTAGCCGCTTTA"_dna5;

        seqan3::nucleotide_scoring_scheme scheme{seqan3::match_score{1},
                                                 seqan3::mismatch_score{-1}};

        scheme.score('A'_dna5, 'T'_dna5) = 1;
        scheme.score('T'_dna5, 'A'_dna5) = 1;
        scheme.score('G'_dna5, 'C'_dna5) = 1;
        scheme.score('C'_dna5, 'G'_dna5) = 1;
        scheme.score('G'_dna5, 'T'_dna5) = 1;
        scheme.score('T'_dna5, 'G'_dna5) = 1;

        scheme.score('T'_dna5, 'T'_dna5) = -1;
        scheme.score('A'_dna5, 'A'_dna5) = -1;
        scheme.score('C'_dna5, 'C'_dna5) = -1;
        scheme.score('G'_dna5, 'G'_dna5) = -1;

        scheme.score('A'_dna5, 'G'_dna5) = -1;
        scheme.score('A'_dna5, 'C'_dna5) = -1;
        scheme.score('G'_dna5, 'A'_dna5) = -1;
        scheme.score('C'_dna5, 'A'_dna5) = -1;

        scheme.score('T'_dna5, 'C'_dna5) = -1;
        scheme.score('C'_dna5, 'T'_dna5) = -1;

        auto alignConfig =
            seqan3::align_cfg::method_local{} | seqan3::align_cfg::scoring_scheme{scheme} |
            seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-2},
                                               seqan3::align_cfg::extension_score{-2}} |
            seqan3::align_cfg::output_score{} | seqan3::align_cfg::output_alignment{} |
            seqan3::align_cfg::detail::debug{};

        auto alignments = seqan3::align_pairwise(std::tie(source, target), alignConfig);

        for (auto &&alignment : alignments) {
            const int score = alignment.score();

            using TraceMatrix = seqan3::detail::two_dimensional_matrix<
                std::optional<seqan3::detail::trace_directions>>;
            using ScoreMatrix = seqan3::detail::row_wise_matrix<std::optional<int>>;
            const TraceMatrix traceMatrixOptional{alignment.trace_matrix()};
            const ScoreMatrix scoreMatrix{alignment.score_matrix()};

            seqan3::debug_stream << "Score: " << score << std::endl;
            seqan3::debug_stream << scoreMatrix << std::endl;

            auto it = std::ranges::find(scoreMatrix, score);
            while (it != scoreMatrix.end()) {
                size_t row = std::distance(scoreMatrix.begin(), it) / scoreMatrix.cols();
                size_t col = std::distance(scoreMatrix.begin(), it) % scoreMatrix.cols();

                seqan3::detail::matrix_coordinate traceBegin{
                    seqan3::detail::row_index_type{row}, seqan3::detail::column_index_type{col}};

                std::vector<seqan3::detail::trace_directions> traceDirections;

                std::transform(
                    traceMatrixOptional.begin(), traceMatrixOptional.end(),
                    std::back_inserter(traceDirections), [](const auto &direction) {
                        return direction.value_or(seqan3::detail::trace_directions::none);
                    });

                seqan3::detail::number_rows rowsN{traceMatrixOptional.rows()};
                seqan3::detail::number_cols colsN{traceMatrixOptional.cols()};

                seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>
                    traceMatrix(rowsN, colsN, traceDirections);

                seqan3::detail::aligned_sequence_builder builder{source, target};

                using matrix_t =
                    seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>;
                using matrix_iter_t = std::ranges::iterator_t<matrix_t const>;
                using trace_iterator_t = seqan3::detail::trace_iterator<matrix_iter_t>;
                using path_t = std::ranges::subrange<trace_iterator_t, std::default_sentinel_t>;

                seqan3::detail::matrix_offset const offset{traceBegin};

                auto cooptimalTracePath =
                    path_t{trace_iterator_t{traceMatrix.begin() + offset}, std::default_sentinel};

                seqan3::debug_stream << "Trace path: " << cooptimalTracePath << std::endl;

                it = std::ranges::find(it + 1, scoreMatrix.end(), score);
            }
        }
    };