intel / isa-l_crypto

Other
267 stars 80 forks source link

Which underlying algorithm does `rolling_hash` use? #84

Closed leiless closed 2 years ago

leiless commented 2 years ago

Hi @gbtucker, I wanna know which underlying algorithm does rolling hash implements.

My company by now uses the Adler-32 algorithm for rolling checksum. We want to migrate to ISAL rolling hash, before that, we need to know the migration costs.

gbtucker commented 2 years ago

Hi @leiless, the hash algorithm in rolling_hash2_run() is fairly arbitrary but should have good dispersal over 64 bits. It uses a random table defined in rolling_hash2_table.h to calculate the hash over a rolling window by width you define.

        hash = 0;
        for (i = 0; i < width; i++) {
                hash = (hash << 1) | (hash >> (64 - 1));
                hash ^= table[in_byte[i]];
        }

However, updates are done more efficiently by adding the contribution from each new byte at head of window and subtracting the contribution at end of window. This is much faster than re-computing the whole window each update.

uint64_t hash_update(uint64_t hash, uint8_t new_char, uint8_t old_char)
{
        hash = (hash << 1) | (hash >> (64 - 1));
        hash ^= table1[new_char] ^ table2[old_char];
        return h;
}

We also integrate the checking for hash & mask == trigger such as when looking for a chunk boundary.

leiless commented 2 years ago

the hash algorithm in rolling_hash2_run() is fairly arbitrary but should have good dispersal over 64 bits

How about the dispersal for 32-bit hash?

gbtucker commented 2 years ago

Sure, usually you select a mask to select which bits to trigger on based on the average chunk length but no particular set of bits should be better than any other.

leiless commented 2 years ago

Sure, usually you select a mask to select which bits to trigger on based on the average chunk length but no particular set of bits should be better than any other.

Hi, I have no previously rolling hash background, but do you mean when I got an uint64_t hash, I can do like uint64_t & 0xffffffff (mask = 0xffffffff) or uint64_t >> 32 to get a decent uint32_t hash?

And, also, what does the trigger, average chunk length mean particularly?

leiless commented 2 years ago

Another question is that, can I run this library in a 32-bit system?

gbtucker commented 2 years ago

Hi, I have no previously rolling hash background, but do you mean when I got an uint64_t hash, I can do like uint64_t & 0xffffffff (mask = 0xffffffff) or uint64_t >> 32 to get a decent uint32_t hash?

Yes, you can use just part of the hash if you want.

And, also, what does the trigger, average chunk length mean particularly?

Rolling hash is often used for variable length chunking to find common boundaries from data patterns. The mask, trigger and window size are parameters that can be set to affect the quality and frequency of boundaries found.

Another question is that, can I run this library in a 32-bit system?

Many of the optimizations have not been back-ported to 32-bit but it may be possible to build a limited-functionality library in x86_32.