Closed leiless closed 2 years ago
Hi @leiless, the hash algorithm in rolling_hash2_run() is fairly arbitrary but should have good dispersal over 64 bits. It uses a random table defined in rolling_hash2_table.h to calculate the hash over a rolling window by width you define.
hash = 0;
for (i = 0; i < width; i++) {
hash = (hash << 1) | (hash >> (64 - 1));
hash ^= table[in_byte[i]];
}
However, updates are done more efficiently by adding the contribution from each new byte at head of window and subtracting the contribution at end of window. This is much faster than re-computing the whole window each update.
uint64_t hash_update(uint64_t hash, uint8_t new_char, uint8_t old_char)
{
hash = (hash << 1) | (hash >> (64 - 1));
hash ^= table1[new_char] ^ table2[old_char];
return h;
}
We also integrate the checking for hash & mask == trigger
such as when looking for a chunk boundary.
the hash algorithm in rolling_hash2_run() is fairly arbitrary but should have good dispersal over 64 bits
How about the dispersal for 32-bit hash?
Sure, usually you select a mask to select which bits to trigger on based on the average chunk length but no particular set of bits should be better than any other.
Sure, usually you select a mask to select which bits to trigger on based on the average chunk length but no particular set of bits should be better than any other.
Hi, I have no previously rolling hash background, but do you mean when I got an uint64_t
hash, I can do like uint64_t & 0xffffffff
(mask = 0xffffffff
) or uint64_t >> 32
to get a decent uint32_t
hash?
And, also, what does the trigger
, average chunk length
mean particularly?
Another question is that, can I run this library in a 32-bit system?
Hi, I have no previously rolling hash background, but do you mean when I got an uint64_t hash, I can do like uint64_t & 0xffffffff (mask = 0xffffffff) or uint64_t >> 32 to get a decent uint32_t hash?
Yes, you can use just part of the hash if you want.
And, also, what does the trigger, average chunk length mean particularly?
Rolling hash is often used for variable length chunking to find common boundaries from data patterns. The mask, trigger and window size are parameters that can be set to affect the quality and frequency of boundaries found.
Another question is that, can I run this library in a 32-bit system?
Many of the optimizations have not been back-ported to 32-bit but it may be possible to build a limited-functionality library in x86_32.
Hi @gbtucker, I wanna know which underlying algorithm does rolling hash implements.
My company by now uses the Adler-32 algorithm for rolling checksum. We want to migrate to ISAL rolling hash, before that, we need to know the migration costs.