lemire / rollinghashcpp

Rolling Hash C++ Library
186 stars 33 forks source link

Moving from N-gram to (N+1)-gram #2

Closed iminkin closed 9 years ago

iminkin commented 9 years ago

Hi,

Is it possible to get quickly hash of an (N+1)-gram given that we have a hash function object, representing an N-gram of that (N+1)-gram? I.e. given hash value of "ABCD", can I have value of "ABCDE", without computing the whole hash value?

lemire commented 9 years ago

Yes. I will write up an example later.

iminkin commented 9 years ago

Looking forward to see how :+1:

lemire commented 9 years ago

@ilyaminkin

Here is a short example that does it:

https://github.com/lemire/rollinghashcpp/blob/master/example2.cpp#L14-L32

I am closing this issue, reopen if you disagree.

iminkin commented 9 years ago

@lemire

Is it a good idea to use the ngram hash values as they are, modulo 2^n, not modulo some prime number?

lemire commented 9 years ago

@ilyaminkin Yes. It should be fine. I would not apply modulo a prime number.

E.g., in the case where the function is strongly universal, then the function modulo 2^n will still be strongly universal.

In general, modulo 2^n with a family of hash function is fine, as long as they are almost delta-universal, or some related property.

iminkin commented 9 years ago

@lemire

A few more questions:

1) Is it possible to roll the hash not only left to right, but in the "reverse" direction? My use case is the following. Say I have the string "ABCDEFG". I consider pairs of substrings: ABC/CBA, BCD/DCB, CDE/EDC and so on. If know hashes of both ABC and CBA, can I then quickly compute hashes of both BCD and DCB? Practically, I can precompute hash values in both direct and reverse directions, but I would like to avoid that. [This example is motivated by reverse-complementary sequences in biology]

2) Given hash, say, of ABC, can I quickly know the hash of ZABC, where Z is a single character?

lemire commented 9 years ago

@ilyaminkin The answer is yes in both instances. I would just need some time to come up with working examples.

iminkin commented 9 years ago

@lemire

Cool, thank you for your support.

lemire commented 9 years ago

@ilyaminkin

I have coded an example that solves the first problem :

https://github.com/lemire/rollinghashcpp/blob/master/example3.cpp#L9-L33

The second problem can be solved easily with code... but we hit one limitation of the library. It is really meant to be used for fixed length strings that you update. That is, the length of the string is a hard-coded parameter set by the constructor and it is not supposed to change. We make an exception to allow you to grow the string initially so you can reach this maximum length... But then if you are allowed to grow the length of the strings from both end, it is going to get really confusing.

It is not an algorithmic limitation... it is just a limitation with the design of the source code as it stands now.

To fix the problem, I would need to do a redesign, but this would take more time than I have right now.

If you have a bit of time and you are interested in helping out, we could collaborate toward something... but otherwise, I am afraid you will have to wait for me to have a lot more free time than I do currently.

I hope this helps.

iminkin commented 9 years ago

@lemire,

Thank you so much.

That is, the length of the string is a hard-coded parameter set by the constructor and it is not supposed to change.

A solution to this could be creating a new hash function object given an existing one with smaller length.

If you have a bit of time and you are interested in helping out, we could collaborate toward something...

I do have time and be glad to help.

lemire commented 9 years ago

@ilyaminkin

Great. I will get back to you soon.

lemire commented 9 years ago

@ilyaminkin

What we could do, if you are interested, is start by fully working out the problem that you (and I) want to solve. Feel free to email me if you want (lemire@gmail.com). For now, here are some thoughts :

  1. What does one seeks to achieve? I have the impression that you are working in genomics or something related. You do not have to reveal to me anything that would be confidential, but the more you can share about the algorithmics of the problem, the better. Can we come up, for example, with specific benchmarks? Given benchmarks, then we can seek to optimize results, it is always more fun that way...
  2. Can we come up with a relatively complete set of functions that one should support?
  3. What kind of hardware is expected? Can we assume a recent Intel processor? If so, this opens up possibly interesting doors. Haswell processors and better have fast finite field multiplication functions. See for example http://arxiv.org/abs/1503.03465
  4. If we can narrow down the characteristics of the data, that would be excellent too. For example, if it is genomic data, then presumably each "character" is not a byte.
  5. Is this something that could lead to a research paper? Would you be interested in pursuing something of the sort?
iminkin commented 9 years ago

@lemire,

I realized that for my application there is no need to redesign the library. I would benefit from two methods:

hash_extend(char Y) -- for an n-gram X it returns hash value of (n + 1)-gram XY without changing the object X. For example, if X = "ABC", then X.hash_extend("D") returns value of "ABCD" without changing the state of X

hash_prepend(char Y) -- the same, but with prepending the n-gram with character Y. If X = "ABC", then X.hash_prepend("D") returns value of "DABC" without changing the state of X

My application looks like this:

X = rolling_window(N)
while (...)
{
    hv1 = X.hash_extend(Y) //X still has length N
    //Do something using hv1 
    hv2 = X.hash_prepend(Y) //X still has length N
    //Do something using hv2
    X.update(...) 
}
lemire commented 9 years ago

@ilyaminkin

Ok.

lemire commented 9 years ago

@ilyaminkin

See last checkin. It is not tested, but should do what you want.

iminkin commented 9 years ago

Doesn't seem to work...

#include <string>
#include <memory>
#include <cassert>
#include <iostream>

#include "ngramhashing/cyclichash.h"

int main(int argc, char * argv[])
{
    CyclicHash<uint64_t>  hf(5, 19);
    string input = "XABCDY";
    string base(input.begin() + 1, input.end() - 1);
    string extend(input.begin() + 1, input.end());
    string prepend(input.begin(), input.end() - 1);

    for (char ch : base)
    {
        hf.eat(ch);
    }

    std::cout << base << " "  << hf.hash(base) << std::endl;
    std::cout << prepend << " " << hf.hash_prepend(input[0]) << " " << hf.hash(prepend) << std::endl;
    std::cout << extend << " " << hf.hash_extend(input.back()) << " " << hf.hash(extend) << std::endl;

    assert(hf.hashvalue == hf.hash(base));
    assert(hf.hash_prepend(input[0]) == hf.hash(prepend));
    assert(hf.hash_extend(input.back()) == hf.hash(extend));

    return 0;

}
lemire commented 9 years ago

@ilyaminkin As I wrote, untested code... Let me have a look.

lemire commented 9 years ago

@ilyaminkin

Fixed. A modified version of your test is now example4.cpp.

Note that you should declare the hashing object as CyclicHash<uint64_t> hf(4, 64).

iminkin commented 8 years ago

@lemire

Can we mention you in the "acknowledgements" section of our paper and presentation?

lemire commented 8 years ago

Sure.

iminkin commented 7 years ago

Our paper is out: https://doi.org/10.1093/bioinformatics/btw609

Again, thank you for cooperation.

lemire commented 7 years ago

Excellent. I will modify the README to cite your paper.