New hash function - Githubissues

Artoria2e5 commented 4 years ago

Coming from https://github.com/richox/orz/commit/c656c073192df135538759cb9e32b978d0612a16#r37659833

We should probably comment here that hash_dword is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).

Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:

let pear: [u8; 256] = /* RFC 3074 table here */;
#[inline]
fn hash_pearson(val: u32) -> u8 {
  let mut h: u8 = pear[val << 24];
  h = pear[h ^ (val << 16) % 256];
  h = pear[h ^ (val << 8)  % 256];
  h = pear[h ^ (val)       % 256];
  h
}

/// Hash a u32 from buf[pos] to a usize (always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE).
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
   let val = buf.read::<u32>(pos).to_be() as u32;
   (hash_pearson(val) << 8) | (hash_pearson(val ^ 0x01000000))
}

(djb2 looks cool too, if you like the multiplication stuff.)

richox commented 4 years ago

i tested with this hash, but it doesn't improve the compression ratio, and it's a little slower then the original one

Artoria2e5 commented 4 years ago

Yeah, I can see it is simpler and definitely faster, but I am a bit unsure about the 131 (0b0100_0011) thing and how bits are (not) mixed.

Just for the sake of completeness, here are some hash functions used by hash tables in matchers:

lzfse uses Knuth's multiplicative hash: https://github.com/lzfse/lzfse/blob/e634ca58b4821d9f3d560cdc6df5dec02ffc93fd/src/lzfse_encode_base.c
zstd uses the same prime number Kunth stuff for 32-bit: https://github.com/facebook/zstd/blob/a880ca239b447968493dd2fed3850e766d6305cc/contrib/linux-kernel/lib/zstd/compress.c#L954-L956
xz uses something that smells like crc32. https://github.com/xz-mirror/xz/blob/c460f6defebc5a81bbca90adc2476154ca244f69/src/liblzma/lz/lz_encoder_hash.h

The Knuth thing looks okay and we can probably trust it. So this:

/// Hash a u32 from buf[pos] to a usize (always used modulo LZ_MF_BUCKET_ITEM_HASH_SIZE)
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
   let val = buf.read::<u32>(pos).to_be() as u32;
   // 16 == (super::LZ_MF_BUCKET_ITEM_HASH_SIZE as f32).log2().ceil() as u32
   (val * 2654435761) >> (32 - 16) as usize
}

It may be worse than the 131 thing, since it eats more higher bits. For that we can try a let h = (val as u64) * 2654435761 with a return of h >> (64 - 16) ^ h >> (32 - 16). Or we can just keep 131.

131 is great. Let's keep it. https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=f1fec7ed0a13b41bf07b46d46b3f010a

richox commented 4 years ago

using byte oriented xor is also not a good idea, because xor with same characters always results zero, that will cause a lot of hash conflicts. i don't even suggest using xor at all, each bit of a character has its statistical principles, xor just mix it with another bit. that's not we expect. instead we should mix each bit with all other bits, like mul/div.

richox / orz

New hash function #12