Closed iminkin closed 9 years ago
Yes. I will write up an example later.
Looking forward to see how :+1:
@ilyaminkin
Here is a short example that does it:
https://github.com/lemire/rollinghashcpp/blob/master/example2.cpp#L14-L32
I am closing this issue, reopen if you disagree.
@lemire
Is it a good idea to use the ngram hash values as they are, modulo 2^n, not modulo some prime number?
@ilyaminkin Yes. It should be fine. I would not apply modulo a prime number.
E.g., in the case where the function is strongly universal, then the function modulo 2^n will still be strongly universal.
In general, modulo 2^n with a family of hash function is fine, as long as they are almost delta-universal, or some related property.
@lemire
A few more questions:
1) Is it possible to roll the hash not only left to right, but in the "reverse" direction? My use case is the following. Say I have the string "ABCDEFG". I consider pairs of substrings: ABC/CBA, BCD/DCB, CDE/EDC and so on. If know hashes of both ABC and CBA, can I then quickly compute hashes of both BCD and DCB? Practically, I can precompute hash values in both direct and reverse directions, but I would like to avoid that. [This example is motivated by reverse-complementary sequences in biology]
2) Given hash, say, of ABC, can I quickly know the hash of ZABC, where Z is a single character?
@ilyaminkin The answer is yes in both instances. I would just need some time to come up with working examples.
@lemire
Cool, thank you for your support.
@ilyaminkin
I have coded an example that solves the first problem :
https://github.com/lemire/rollinghashcpp/blob/master/example3.cpp#L9-L33
The second problem can be solved easily with code... but we hit one limitation of the library. It is really meant to be used for fixed length strings that you update. That is, the length of the string is a hard-coded parameter set by the constructor and it is not supposed to change. We make an exception to allow you to grow the string initially so you can reach this maximum length... But then if you are allowed to grow the length of the strings from both end, it is going to get really confusing.
It is not an algorithmic limitation... it is just a limitation with the design of the source code as it stands now.
To fix the problem, I would need to do a redesign, but this would take more time than I have right now.
If you have a bit of time and you are interested in helping out, we could collaborate toward something... but otherwise, I am afraid you will have to wait for me to have a lot more free time than I do currently.
I hope this helps.
@lemire,
Thank you so much.
That is, the length of the string is a hard-coded parameter set by the constructor and it is not supposed to change.
A solution to this could be creating a new hash function object given an existing one with smaller length.
If you have a bit of time and you are interested in helping out, we could collaborate toward something...
I do have time and be glad to help.
@ilyaminkin
Great. I will get back to you soon.
@ilyaminkin
What we could do, if you are interested, is start by fully working out the problem that you (and I) want to solve. Feel free to email me if you want (lemire@gmail.com). For now, here are some thoughts :
@lemire,
I realized that for my application there is no need to redesign the library. I would benefit from two methods:
hash_extend(char Y) -- for an n-gram X it returns hash value of (n + 1)-gram XY without changing the object X. For example, if X = "ABC", then X.hash_extend("D") returns value of "ABCD" without changing the state of X
hash_prepend(char Y) -- the same, but with prepending the n-gram with character Y. If X = "ABC", then X.hash_prepend("D") returns value of "DABC" without changing the state of X
My application looks like this:
X = rolling_window(N)
while (...)
{
hv1 = X.hash_extend(Y) //X still has length N
//Do something using hv1
hv2 = X.hash_prepend(Y) //X still has length N
//Do something using hv2
X.update(...)
}
@ilyaminkin
Ok.
@ilyaminkin
See last checkin. It is not tested, but should do what you want.
Doesn't seem to work...
#include <string>
#include <memory>
#include <cassert>
#include <iostream>
#include "ngramhashing/cyclichash.h"
int main(int argc, char * argv[])
{
CyclicHash<uint64_t> hf(5, 19);
string input = "XABCDY";
string base(input.begin() + 1, input.end() - 1);
string extend(input.begin() + 1, input.end());
string prepend(input.begin(), input.end() - 1);
for (char ch : base)
{
hf.eat(ch);
}
std::cout << base << " " << hf.hash(base) << std::endl;
std::cout << prepend << " " << hf.hash_prepend(input[0]) << " " << hf.hash(prepend) << std::endl;
std::cout << extend << " " << hf.hash_extend(input.back()) << " " << hf.hash(extend) << std::endl;
assert(hf.hashvalue == hf.hash(base));
assert(hf.hash_prepend(input[0]) == hf.hash(prepend));
assert(hf.hash_extend(input.back()) == hf.hash(extend));
return 0;
}
@ilyaminkin As I wrote, untested code... Let me have a look.
@ilyaminkin
Fixed. A modified version of your test is now example4.cpp.
Note that you should declare the hashing object as CyclicHash<uint64_t> hf(4, 64)
.
@lemire
Can we mention you in the "acknowledgements" section of our paper and presentation?
Sure.
Our paper is out: https://doi.org/10.1093/bioinformatics/btw609
Again, thank you for cooperation.
Excellent. I will modify the README to cite your paper.
Hi,
Is it possible to get quickly hash of an (N+1)-gram given that we have a hash function object, representing an N-gram of that (N+1)-gram? I.e. given hash value of "ABCD", can I have value of "ABCDE", without computing the whole hash value?