seqan / product_backlog

This repository is used as product backlog for all SeqAn relevant backlog items. This is intended to organise the work for the team.
2 stars 1 forks source link

Reduce the k-mer content of a sequence using minimizer in a sliding window, such that run time is increased and memory usage is decreased. #35

Closed rrahn closed 4 years ago

rrahn commented 4 years ago

Description

Acceptance Criteria

Tasks

- [x] write a view that only iterates over the minimizers inside of a given window over a underlying range - [x] add a view closure object that is configured with the window size, minimizer size and the k-mer size. How are these provided? Possibly via a strong type. What are the strong types for this - [x] test the view in different scenarios. - [x] document the views properties accordingly - [ ] write a tutorial showing a use case for the minimizer view (will be done in new backlog item) ### Definition of Done - [x] Implementation and design approved - [x] Unit tests pass - [x] Test coverage = 100% - [ ] Microbenchmarks added and/or affected microbenchmarks < 5% performance drop (create new backlog item) - [x] API documentation added - [ ] Tutorial/teaching material added (create new backlog item) - [ ] Test suite compiles in less than 30 seconds (on travis) - [x] Changelog entry added
MitraDarja commented 4 years ago

I think instead of giving the view a kmer size and a minimizer size, we should use seqan3::shape. Furthermore, a minimizer view should have an optional argument named seed, which determines which value is used for randomization.

rrahn commented 4 years ago

Good points. Can you sketch how you think or how you have planned to use the view as a small snippet here?

MitraDarja commented 4 years ago

My idea would be something like that:

// o for seed, shape, window_size (maybe in a different order, if seed and window_size are optional.
auto test = "ACGGCGACGTTTAG" |  seqan3::views::minimizer(0, 0b1001_shape, 8);
// Should result in A--G, C--C, A--T, a--g, a--c - "-" for gap
auto test = "ACGGCGACGTTTAG" |  seqan3::views::minimizer(0,seqan3::ungapped{4}, 8);
// Should result in:
//ACGG, CGAC, ACGT, aacg, aaac - lowercase for reverse complement

// If k=w and user does not care which seed is used, only shape is given. Resulting minimizers 
// depend on the default seed (some random value we pick)
auto test = "ACGGCGACGTTTAG" |  seqan3::views::minimizer(seqan3::ungapped{4});
rrahn commented 4 years ago

Ok I think we should introduce a minimiser_shape object:

minimiser_shape min_shape{.shape = ungapped{5}, .window_size = 8}

Then the view could be (I would call it minimise):

auto test = std::string_view{"ACGGCGACGTTTAG"} |  seqan3::views::minimise(minimiser_shape{...}, seed);

In this case we can avoid adding a strong type for the window size and the seed. Also the minimiser_shape now groups the relevant information necessary to describe such a scheme.

In that case we should think about the properties that a minimiser_shape should offer.

Given that we have a view called kmer_hash it might be better to call the view:

minimiser_hash
MitraDarja commented 4 years ago

Are we using minimiser with an s or an z?

@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.

I do not really know, what the advantages or disadvantages of using a minimizer shape are, so I would just implement whatever you prefer. :)

@rrahn Does the tag "needs refinement" now mean I should not work on this until it is tagged "ready for sprint" again?

rrahn commented 4 years ago

Are we using minimiser with an s or an z?

Since we made sure we document everyting in british english it would be weird to start with american english in the name of the entities, wouldn't it?

No please continue to work on that. But for me not everything is clear yet, e.g.

@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.

What is the use case for the first one? You only depicted the use case for the latter in your example.

@rrahn Does the tag "needs refinement" now mean I should not work on this until it is tagged "ready for sprint" again?

No, please continue. I meant this to be a visual flag for saying this is not completely clear from the design. Like the arguments to the view. That should be discussed. But the view implementation is clear (at least the expected one working on a text) But I agree, that is irretating to remove ready for sprint.

MitraDarja commented 4 years ago

Since we made sure we document everyting in british english it would be weired to start with american english in the name of the entities, wouldn't it?

True.

@smehringer and I discussed to have a view named minimizer, which gets a range of kmer_hash values and a view named minimizer_hash, which applies to a given text first kmer_hash and then minimizer.

What is the use case for the first one? You only depicted the use case for the latter in your example.

I am not sure, could be that @smehringer divided task like that, so it was easier for me to understand and implement...

No, please continue. I meant this to be a visual flag for saying this is not hunder prercent clear from the design. Like the arguments to the view. That should be discussed. But the view implementation is clear (at least the expected one working on a text) But I agree, that is irretating to remove ready for sprint.

Ok. :)

One use case I forgot to mention, is the case where we don't want to consider the reverse complements for defining minimizers. So:

auto test = "ACGGCGACGTTTAG" |  seqan3::views::minimizer(0,seqan3::ungapped{4}, 8);
// Should result in:
//ACGG, CGAC, ACGT
rrahn commented 4 years ago

That would perfectly fit into a class like the minimiser_shape as an additional property wouldn't it?

smehringer commented 4 years ago

The reason behind 2 views (names), was that computing the minimizer from a range of kmers is something unrelated to hashing. So I propose the following (no matter the naming and parameters!):

// version 1, computes the minimizers directly from a text
text | seqan3::views::minimizer_hash(...); 

// version 2, computes the minimizers from a kmer range
text | seqan3::views::kmer_hash | seqan3::views::minimize(...); 

(The first will just be implemented by the combination of the second)

This has the advantage that

  1. We just reuse the kmer_hash (which we would do anyway internally)
  2. The user can plug-in a different method of hashing

This would also have implications on the " seed". For me, this is part of hashing and not of computing the minimizer. Passing the seed as a parameter would not be necessary with my proposal as it can be done with:

auto scew = std::views::transform([] (auto n) { return n ^ some_seed_constant; });
text | seqan3::views::kmer_hash | scew | seqan3::views::minimize(...); 

EDIT: The proposal above only considers one kmer range as input. This is a use case if you want to compute minimizer on a e.g. genome. If you want to compute minimizers on reads you can not be sure if they are of the forward or reverse strand so it proved to work well if you compute the minimizer regarding both directions. In my proposal, this would implicate that (at least) the seqan3::views::minimize would need an optional second kmer range as an input parameter.

auto reverse_kmers = test | seqan3::views::reverse | seqan3::views::kmer_hash;
// computes the minimimal kmer in a window considering two input ranges of the same size
text | seqan3::views::kmer_hash | seqan3::views::minimize(reverse_kmers, ...); 
MitraDarja commented 4 years ago

But if the seed is not part of the minimizer implementation, than using just minimise_hash or minimize will lead to the usage of the lexicographical ordering by default and because that is proven to be bad, it should not be our default. I mean I could add the transformation only to minimal hash, when I have to transform the text to hash values anyways but than minimize will continue to use the bad ordering scheme.

smehringer commented 4 years ago

I mean I could add the transformation only to minimal hash, when I have to transform the text to hash values anyways

Yes, that's what I would propose.

but than minimize will continue to use the bad ordering scheme.

Well, If someone does not use our views::minimizer_hash (that by default has the best possible behaviour), I would assume that someone knows what (s)he's doing. I would, therefore, prefer to have the most generic version of "computing a minimum hash per window" to be most flexible.

And I do have an actual use case for this: In my application, I would like to store the position of the kmer with every hash value. For this, I can implement my own hashing, with the result of a struct holding both information and being comparable by the kmer-hash value. I could then pipe these hash values to the minimizer_view that just compares "hashes" without losing my information of the position.

MitraDarja commented 4 years ago

If we want two views, does a minimiser_shape still make sense? And if so, which view would be using it? Minimise_hash will be build on shape, window_size and maybe seed, minimise needs only kmer size and window size.

smehringer commented 4 years ago

I don't think the views::minimize needs the kmer size, if the window size will be given as the "number of hash values" within a window, instead of the "number of bases of the text" within a window (This is the only part in your code where the kmer size is needed). Then we also don't need a minimizer_shape in my opinion.

MitraDarja commented 4 years ago

@rrahn

If the window size is bigger than the sequence, then the window is shrinked to the size of the sequence

Does that make sense? If window_size == text_length, we have maximal text_length - kmer_size +1 many minimizers... I would assume the user made a mistake when this happens and throw.

rrahn commented 4 years ago

@MitraDarja is there not only one minimiser per window? In that case if |text| <= |window| I would have exactly one minimiser, wouldn't I. If the minimiser is bigger than the text than I would have an empty range. Maybe I am not seeing something.

MitraDarja commented 4 years ago

@rrahn Yes, sorry, we have maximal text_length - kmer_size +1 but only one minimizer, but having only one minimizer does not real have a meaning, does it? That is why I would rather throw and let the user decide, if that is what he actually wants and then he has to enter window_size=text_size or (and I think that is more likely) he will realize he entered the wrong window_size and corrects it to a smaller value.

rrahn commented 4 years ago

No I think that is not the expected behavior. I mean that would mean we guarantee for the view that it always contains at least x minimisers. How should we decide what x really is. If the user has multiple texts some are large and some are small then it is the valid behvior to see only one minimiser or even an empty range if the shape is bigger than the text as well. In this case throwing is the wrong behavior. This also does not happen in other views for example the k-mer hash view, the slice view or the chunk view.

smehringer commented 4 years ago

Currently, we use std::min_element to determine the minimum value within a window which chooses the first minimum if several exist within the same window. We should choose the last minimum that exists because it results in fewer minimizers.

Best example (k = 4, w = 20, doing 6 window shifts): First minimum in window (6 window shifts, 6 minimizers)

seq: AAAAAAAAAAAAAAAAAAAAAAAAAA
win:|--------------------|
     |--------------------|
      |--------------------|
       |--------------------|
        |--------------------|
         |--------------------|
          ...
min: AAAA
      AAAA
       AAAA
        AAAA
         AAAA
          AAAA
          ...

Last minimum in window (6 window shifts, 1 minimizers)

seq: AAAAAAAAAAAAAAAAAAAAAAAAAA
win:|--------------------|
     |--------------------|
      |--------------------|
       |--------------------|
        |--------------------|
         |--------------------|
          ...
min:                 AAAA
                     AAAA
                     AAAA
                     AAAA
                     AAAA
                     AAAA
          ...

this should be easily done by giving a custom Compare that returns <= instead of the default <

EDIT: This only improves the computation if we only change the minimzier if

  1. After a window shift, the new value (and the right end of the window) is strictly smaller than the current minimum.
  2. After the current minimum has been removed from the window (on the left side) with the next window shift.
rrahn commented 4 years ago

I removed the needs refinement. But how long would you think this will take to finish? Can we assume this to be done in this iteration? Otherwise I suggest getting first a working version before adding improvements. That we can do later. Note, that we need at least a micro-benchmark to validate the improvement, also it sounds valid.

smehringer commented 4 years ago

It was not meant to be done within the current PR from @MitraDarja but just as an idea dump.

rrahn commented 4 years ago

I am wondering if should make a separate issue for this then. The reason is that balanaced binning directories should be prepared and tackled soon as Virtor depends on it for Ganon2. So we can optimise while he is integrataing this into the tool.

rrahn commented 4 years ago

@smehringer can you please make another issue for the suggested optimisation.

MitraDarja commented 4 years ago

We create new backlog tickets:

So, this can be closed because it is done.