Closed Artoria2e5 closed 4 years ago
i tested with this hash, but it doesn't improve the compression ratio, and it's a little slower then the original one
Yeah, I can see it is simpler and definitely faster, but I am a bit unsure about the 131 (0b0100_0011
) thing and how bits are (not) mixed.
Just for the sake of completeness, here are some hash functions used by hash tables in matchers:
The Knuth thing looks okay and we can probably trust it. So this:
/// Hash a u32 from buf[pos] to a usize (always used modulo LZ_MF_BUCKET_ITEM_HASH_SIZE)
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
let val = buf.read::<u32>(pos).to_be() as u32;
// 16 == (super::LZ_MF_BUCKET_ITEM_HASH_SIZE as f32).log2().ceil() as u32
(val * 2654435761) >> (32 - 16) as usize
}
It may be worse than the 131 thing, since it eats more higher bits. For that we can try a let h = (val as u64) * 2654435761
with a return of h >> (64 - 16) ^ h >> (32 - 16)
. Or we can just keep 131.
131 is great. Let's keep it. https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=f1fec7ed0a13b41bf07b46d46b3f010a
using byte oriented xor is also not a good idea, because xor with same characters always results zero, that will cause a lot of hash conflicts. i don't even suggest using xor at all, each bit of a character has its statistical principles, xor just mix it with another bit. that's not we expect. instead we should mix each bit with all other bits, like mul/div.
Coming from https://github.com/richox/orz/commit/c656c073192df135538759cb9e32b978d0612a16#r37659833
We should probably comment here that
hash_dword
is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:
(djb2 looks cool too, if you like the multiplication stuff.)