Closed rrahn closed 4 years ago
I think instead of giving the view a kmer size and a minimizer size, we should use seqan3::shape. Furthermore, a minimizer view should have an optional argument named seed, which determines which value is used for randomization.
Good points. Can you sketch how you think or how you have planned to use the view as a small snippet here?
My idea would be something like that:
// o for seed, shape, window_size (maybe in a different order, if seed and window_size are optional.
auto test = "ACGGCGACGTTTAG" | seqan3::views::minimizer(0, 0b1001_shape, 8);
// Should result in A--G, C--C, A--T, a--g, a--c - "-" for gap
auto test = "ACGGCGACGTTTAG" | seqan3::views::minimizer(0,seqan3::ungapped{4}, 8);
// Should result in:
//ACGG, CGAC, ACGT, aacg, aaac - lowercase for reverse complement
// If k=w and user does not care which seed is used, only shape is given. Resulting minimizers
// depend on the default seed (some random value we pick)
auto test = "ACGGCGACGTTTAG" | seqan3::views::minimizer(seqan3::ungapped{4});
Ok I think we should introduce a minimiser_shape
object:
minimiser_shape min_shape{.shape = ungapped{5}, .window_size = 8}
Then the view could be (I would call it minimise):
auto test = std::string_view{"ACGGCGACGTTTAG"} | seqan3::views::minimise(minimiser_shape{...}, seed);
In this case we can avoid adding a strong type for the window size and the seed. Also the minimiser_shape now groups the relevant information necessary to describe such a scheme.
In that case we should think about the properties that a minimiser_shape should offer.
Given that we have a view called kmer_hash it might be better to call the view:
minimiser_hash
Are we using minimiser with an s or an z?
@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.
I do not really know, what the advantages or disadvantages of using a minimizer shape are, so I would just implement whatever you prefer. :)
@rrahn Does the tag "needs refinement" now mean I should not work on this until it is tagged "ready for sprint" again?
Are we using minimiser with an s or an z?
Since we made sure we document everyting in british english it would be weird to start with american english in the name of the entities, wouldn't it?
No please continue to work on that. But for me not everything is clear yet, e.g.
@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.
What is the use case for the first one? You only depicted the use case for the latter in your example.
@rrahn Does the tag "needs refinement" now mean I should not work on this until it is tagged "ready for sprint" again?
No, please continue. I meant this to be a visual flag for saying this is not completely clear from the design. Like the arguments to the view. That should be discussed. But the view implementation is clear (at least the expected one working on a text) But I agree, that is irretating to remove ready for sprint
.
Since we made sure we document everyting in british english it would be weired to start with american english in the name of the entities, wouldn't it?
True.
@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.
What is the use case for the first one? You only depicted the use case for the latter in your example.
I am not sure, could be that @smehringer divided task like that, so it was easier for me to understand and implement...
No, please continue. I meant this to be a visual flag for saying this is not hunder prercent clear from the design. Like the arguments to the view. That should be discussed. But the view implementation is clear (at least the expected one working on a text) But I agree, that is irretating to remove
ready for sprint
.
Ok. :)
One use case I forgot to mention, is the case where we don't want to consider the reverse complements for defining minimizers. So:
auto test = "ACGGCGACGTTTAG" | seqan3::views::minimizer(0,seqan3::ungapped{4}, 8);
// Should result in:
//ACGG, CGAC, ACGT
That would perfectly fit into a class like the minimiser_shape
as an additional property wouldn't it?
The reason behind 2 views (names), was that computing the minimizer from a range of kmers is something unrelated to hashing. So I propose the following (no matter the naming and parameters!):
// version 1, computes the minimizers directly from a text
text | seqan3::views::minimizer_hash(...);
// version 2, computes the minimizers from a kmer range
text | seqan3::views::kmer_hash | seqan3::views::minimize(...);
(The first will just be implemented by the combination of the second)
This has the advantage that
kmer_hash
(which we would do anyway internally)This would also have implications on the " seed
". For me, this is part of hashing and not of computing the minimizer. Passing the seed as a parameter would not be necessary with my proposal as it can be done with:
auto scew = std::views::transform([] (auto n) { return n ^ some_seed_constant; });
text | seqan3::views::kmer_hash | scew | seqan3::views::minimize(...);
EDIT:
The proposal above only considers one kmer range as input. This is a use case if you want to compute minimizer on a e.g. genome. If you want to compute minimizers on reads you can not be sure if they are of the forward or reverse strand so it proved to work well if you compute the minimizer regarding both directions. In my proposal, this would implicate that (at least) the seqan3::views::minimize
would need an optional second kmer range as an input parameter.
auto reverse_kmers = test | seqan3::views::reverse | seqan3::views::kmer_hash;
// computes the minimimal kmer in a window considering two input ranges of the same size
text | seqan3::views::kmer_hash | seqan3::views::minimize(reverse_kmers, ...);
But if the seed is not part of the minimizer implementation, than using just minimise_hash or minimize will lead to the usage of the lexicographical ordering by default and because that is proven to be bad, it should not be our default. I mean I could add the transformation only to minimal hash, when I have to transform the text to hash values anyways but than minimize will continue to use the bad ordering scheme.
I mean I could add the transformation only to minimal hash, when I have to transform the text to hash values anyways
Yes, that's what I would propose.
but than minimize will continue to use the bad ordering scheme.
Well, If someone does not use our views::minimizer_hash
(that by default has the best possible behaviour), I would assume that someone knows what (s)he's doing. I would, therefore, prefer to have the most generic version of "computing a minimum hash per window" to be most flexible.
And I do have an actual use case for this: In my application, I would like to store the position of the kmer with every hash value. For this, I can implement my own hashing, with the result of a struct holding both information and being comparable by the kmer-hash value. I could then pipe these hash values to the minimizer_view that just compares "hashes" without losing my information of the position.
If we want two views, does a minimiser_shape still make sense? And if so, which view would be using it? Minimise_hash will be build on shape, window_size and maybe seed, minimise needs only kmer size and window size.
I don't think the views::minimize
needs the kmer size, if the window size will be given as the "number of hash values" within a window, instead of the "number of bases of the text" within a window (This is the only part in your code where the kmer size is needed). Then we also don't need a minimizer_shape
in my opinion.
@rrahn
If the window size is bigger than the sequence, then the window is shrinked to the size of the sequence
Does that make sense? If window_size == text_length, we have maximal text_length - kmer_size +1 many minimizers... I would assume the user made a mistake when this happens and throw.
@MitraDarja is there not only one minimiser per window? In that case if |text| <= |window| I would have exactly one minimiser, wouldn't I. If the minimiser is bigger than the text than I would have an empty range. Maybe I am not seeing something.
@rrahn Yes, sorry, we have maximal text_length - kmer_size +1 but only one minimizer, but having only one minimizer does not real have a meaning, does it? That is why I would rather throw and let the user decide, if that is what he actually wants and then he has to enter window_size=text_size or (and I think that is more likely) he will realize he entered the wrong window_size and corrects it to a smaller value.
No I think that is not the expected behavior. I mean that would mean we guarantee for the view that it always contains at least x minimisers. How should we decide what x
really is. If the user has multiple texts some are large and some are small then it is the valid behvior to see only one minimiser or even an empty range if the shape is bigger than the text as well. In this case throwing is the wrong behavior. This also does not happen in other views for example the k-mer hash view, the slice view or the chunk view.
Currently, we use std::min_element
to determine the minimum value within a window which chooses the first minimum if several exist within the same window. We should choose the last minimum that exists because it results in fewer minimizers.
Best example (k = 4, w = 20, doing 6 window shifts): First minimum in window (6 window shifts, 6 minimizers)
seq: AAAAAAAAAAAAAAAAAAAAAAAAAA
win:|--------------------|
|--------------------|
|--------------------|
|--------------------|
|--------------------|
|--------------------|
...
min: AAAA
AAAA
AAAA
AAAA
AAAA
AAAA
...
Last minimum in window (6 window shifts, 1 minimizers)
seq: AAAAAAAAAAAAAAAAAAAAAAAAAA
win:|--------------------|
|--------------------|
|--------------------|
|--------------------|
|--------------------|
|--------------------|
...
min: AAAA
AAAA
AAAA
AAAA
AAAA
AAAA
...
this should be easily done by giving a custom Compare that returns <=
instead of the default <
EDIT: This only improves the computation if we only change the minimzier if
I removed the needs refinement. But how long would you think this will take to finish? Can we assume this to be done in this iteration? Otherwise I suggest getting first a working version before adding improvements. That we can do later. Note, that we need at least a micro-benchmark to validate the improvement, also it sounds valid.
It was not meant to be done within the current PR from @MitraDarja but just as an idea dump.
I am wondering if should make a separate issue for this then. The reason is that balanaced binning directories should be prepared and tackled soon as Virtor depends on it for Ganon2. So we can optimise while he is integrataing this into the tool.
@smehringer can you please make another issue for the suggested optimisation.
We create new backlog tickets:
So, this can be closed because it is done.
Description
Acceptance Criteria
Tasks
- [x] write a view that only iterates over the minimizers inside of a given window over a underlying range - [x] add a view closure object that is configured with the window size, minimizer size and the k-mer size. How are these provided? Possibly via a strong type. What are the strong types for this - [x] test the view in different scenarios. - [x] document the views properties accordingly - [ ] write a tutorial showing a use case for the minimizer view (will be done in new backlog item) ### Definition of Done - [x] Implementation and design approved - [x] Unit tests pass - [x] Test coverage = 100% - [ ] Microbenchmarks added and/or affected microbenchmarks < 5% performance drop (create new backlog item) - [x] API documentation added - [ ] Tutorial/teaching material added (create new backlog item) - [ ] Test suite compiles in less than 30 seconds (on travis) - [x] Changelog entry added