seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
410 stars 82 forks source link

[Alignment] Meta-issue #314

Open rrahn opened 6 years ago

rrahn commented 6 years ago

This is a meta-issue collecting tasks which are transformed into single issues:

The implementation is blocked by #295

General

align seems to be the best fit for the namespace in order to distinguish the configurations from other modules. Accordingly, I suggest to call the final invocation functions align_pairwise(...) and align_multiple(...) to distinguish between pairwise and multiple alignment, which differ in the input (2 sets vs 1 set of sequences). If both function remain in the namespace of align, I would suggest only the shorter version align::pairwise(...) and align::multiple(...).

Things to make better in seqan3 than in seqan2

Required Configurations:

Optional Configuration Elements

Execution

Test functionality

Parallelization

This is splitted into epics: seqan/product_backlog#54 and seqan/product_backlog#57

Examples:

auto cfg = align_cfg::gap_affine(gap_open_cost{-10}, align::gap_extend_cost{-1}) | 
            align_cfg::scoring(matrix) |
            align_cfg::result(trace::alignment | trace::gaps_left)

// Delegate interface.
align_pairwise(input_rng, cfg) | on_hit([](auto & res) { std::cout << res.score << "\n" << res.alignment | view::to_cigar; });

// Range based interface
for (auto && res : align_pairwise(input_rng, cfg))
{
    std::cout << res.score << "\n" << res.alignment;
}
h-2 commented 6 years ago

A few things come to my mind scanning over all the new issues:

In general there seem to be three dimensions of configuration right now:

  1. pairwise function or msa function call
  2. local config or global config (what about semi-global...?)
  3. the actual config elements

To be honest, I think that's a bit much. Why again did global VS local have to be separate configs and not config elements? We could reduce the complexity of the configuration by one dimension if users could just do auto cfg = align::with_gap(align::affine, align::gap_open{10}, align::gap_extend{1}) | align::as_global_alignment or | align::with_align_type{align::local | align::global | align::semi_global}.

Myers. Why is it a different align_config? That's seems strange, because myers is not orthogonal to local or global, it's just a different algorithm for doing either, or not? Ideally, I would like to not expose this at all and just pick myers implementation in the background for all pairwise alignments where the config allows this and we expect myers to be faster. If this cannot be reasonably assumed (what are the corner cases?), I would rather implement the switch at the top-level, i.e. have align::pairwise, align::multi and align::pairwise_bitvector functions or something like that. This would underline that it is another type of pairwise alignment (with a more restricted set of allowed config options), but not a third option besides local and global. This option is also the easiest, if one wants to adapt an "automatic" dispatching to the optimal pairwise later on (based on config choices, sequence lengths...).

Concerning naming: as we discussed before, we don't use literal algorithm names (or author names) anywhere else in the library which is good, because it's technical detail. We don't want our users to have to know who Myers is, that's why I proposed "pairwise_bitvector" above, but maybe you can think of something better?

Also concerning the actual function-call being align::pairwise. I think you proposed yourself that functions should be verbs, right? Because right now it is very confusable with the configarations, after all "pairwise" could also be a config option...And I am also not sure that the actual function call should be in the namespace. We didn't do that for anything else, yet, and it also adds to confusion between the configs, config elements and function calls, and it will be callable without the namespace specifier anyway, because of ADL. I would propose instead to have the functions in the regular namespace and only put the config stuff into the extra namespace. This would be in line with other things in the alignment module also not being in the align namespace like aligned_sequence_concept, the scoring schemes et cetera. (or where you planning on putting those into seqan3::align::, too?)

We could name the functions:

seqan3::align_pairwise();
seqan3::align_pairwise_bitvector();
seqan3::align_multi();

So it would just be pulling "align" into the function name. Note that this also turns the function name into a verb again.

What do you think?

h-2 commented 6 years ago

One more thing concerning the delegate interface:

// Delegate interface.
align::pairwise(input_rng, cfg) | on_hit([](auto & res) { std::cout << res.score << "\n" << res.alignment | view::to_cigar; });

If it is implemented as above, the delegate can just work on the range, and therefore the delegate cannot be executed in the thread that computes the result, but only work serially. I would either put the on_hit back into the config (and make the function not return a range in that case), or not implement the interface at all right now (if it doesn't provide a benefit over the range interface, it's not worth it, I think.

rrahn commented 6 years ago

To be honest, I think that's a bit much. Why again did global VS local have to be separate configs and not config elements?

They don't necessarily, but given that there will be more implementations, such as xdrop or split break point, and each of them having different constraints, e.g. not all config_elements can be used arbitrarily with any configuration. On this level, we can a) document well that it is a different algorithm, and b) better check on the constraints. Also consider, that usually you don't switch between global or local within the application, but have one of them executed.

Myers. Why is it a different align_config? That's seems strange, because myers is not orthogonal to local or global, it's just a different algorithm for doing either, or not? Ideally, I would like to not expose this at all and just pick myers implementation in the background for all pairwise alignments where the config allows this and we expect myers to be faster. If this cannot be reasonably assumed (what are the corner cases?), I would rather implement the switch at the top-level, i.e. have align::pairwise, align::multi and align::pairwise_bitvector functions or something like that. This would underline that it is another type of pairwise alignment (with a more restricted set of allowed config options), but not a third option besides local and global. This option is also the easiest, if one wants to adapt an "automatic" dispatching to the optimal pairwise later on (based on config choices, sequence lengths...).

The problem with the pick and choose is, that some configurations would automatically switch the runtime behaviours in the background, which in case of the Myers' is tremendous. It might be possible to integrate a configuration that suggests to the user that the most efficient implementation is used if applicable (nicely, if it would expand on SIMD implementations as well - I will think about it), but I at least want to have a specific configuration to give the specific algorithm. IMO since Myers' is extremely limited it does not make sense to have them as a configuration element. And I don't see the benefit for the top level function, as the configuration is already top level.

Concerning naming: as we discussed before, we don't use literal algorithm names (or author names) anywhere else in the library which is good, because it's technical detail. We don't want our users to have to know who Myers is, that's why I proposed "pairwise_bitvector" above, but maybe you can think of something better?

I don't think another top-level function is fitting in this case. I do agree of giving it another name but I think, bitvector is also insignificant. I could imagine a cfg as global_fast_config.

Also concerning the actual function-call being align::pairwise. I think you proposed yourself that functions should be verbs, right? Because right now it is very confusable with the configarations, after all "pairwise" could also be a config option...And I am also not sure that the actual function call should be in the namespace. We didn't do that for anything else, yet, and it also adds to confusion between the configs, config elements and function calls, and it will be callable without the namespace specifier anyway, because of ADL. I would propose instead to have the functions in the regular namespace and only put the config stuff into the extra namespace. This would be in line with other things in the alignment module also not being in the align namespace like aligned_sequence_concept, the scoring schemes et cetera. (or where you planning on putting those into seqan3::align::, too?)

I in general agree, that's why added the other possibility as a different strategy so we can discuss this. I at least didn't want to have something like, align::align_pairwise.

Well, at the end we need to find a proper design that is less confusing for the people to find things, or where they expect things to have.

So for me putting the configurations and such into align namespace makes it clearer to look for the corresponding sources and elements. An aligned_sequence_concept is used in different places, and thus not bound to the namespace seqan3::align I guess. I would use the same argument for the high-level invocation functions. I think it feels natural, as we would explain first the algorithms, and then how to configure them, and there it is clear that the configurations are subsets of the associated algorithms.

// Delegate interface. align::pairwise(input_rng, cfg) | on_hit([](auto & res) { std::cout << res.score << "\n" << res.alignment | view::to_cigar; }); If it is implemented as above, the delegate can just work on the range, and therefore the delegate cannot be executed in the thread that computes the result, but only work serially. I would either put the on_hit back into the config (and make the function not return a range in that case), or not implement the interface at all right now (if it doesn't provide a benefit over the range interface, it's not worth it, I think.

This is exactly something I need to discuss with @marehr. In general on_hit knows all about the range, and can manage the parallelisation of the range. I can even imagine to further specify the execution mode via the on_hit adaptor (the name could also be different if necessary) and/or some other executors. This design has the advantage, that I don't need to repeatedly implement wrapper interfaces for the algorithms as long as they expose a range interface over the results. It is more like a parallel for_each. The other option is to always pass the executor and put extra stuff into the interface.

But honestly, I can't say anything right now about the limitations/usage scenarios so this needs to be further specified. It is not on the immediate todo anyway. We need to learn more from the push_mi design and the executor design and see which direction the standard is going.

marehr commented 6 years ago

@rrahn I added https://github.com/seqan/seqan/issues/355 as a design consideration.

cc: @smehringer, @h-2

h-2 commented 6 years ago

@rrahn and I already agreed on doing it the "right way" :wink:

xiamaz commented 2 years ago

@rrahn Is xdrop already implemented in seqan3?

smehringer commented 2 years ago

Hi @xiamaz,

Thanks for reaching out. Unfortunately, as far as I know, xdrop hasn't been implemented yet. To be sure we have to wait until @rrahn answers because I know he has been working on the alignment module.