rurban / smhasher

Hash function quality and speed tests
https://rurban.github.io/smhasher/
Other
1.86k stars 179 forks source link

Help to explain behavior of Java string hash code in comparison with e.G. SpookyV2 32 hash code #302

Open korvin2000 opened 1 month ago

korvin2000 commented 1 month ago

It's not a real issue , but a question or strange behaviour I can't explain. 🔢 Sorry ;-/

I have used ported SpookyV2 .3 hash32 bit version library to java and can confirm the results in your tests: for FooXXXXBar, FooBarXXXX, XXXXFooBar with 14776336 keys: I have about 1x number of collisions as expected, so roughly around 25389 +/- (so about (0.99x - 1.00x) as in your tests) . But If I am doing the same test using standard and very very simple java hash code for strings I have for all 3 cases 0 collisions. Simply no collisions at all!!! and java hash code calculated in very simple way: s[0]31^(n-1) + s[1]31^(n-2) + ... + s[n-1]

int hashCode(const std::string& value) {
    int h = 0;
    if (h == 0 && !value.empty()) {
        for (char c : value) {
            h = 31 * h + c;
        }
    }
    return h;
}
int hashCode(const char *value) {
    int h = 0;
    int length = strlen(value);
    if (h == 0 && length > 0) {
        for (int i = 0; i < length; i++) {
            h = 31 * h + value[i];
        }
    }
    return h;
}

for example for "FooBar0001" is "6FA1E30E" as hex why such very simple implementation for a string value beats any over complex implementation in terms of collisions? Many years ago I tried to find an hash code algorithm that would beat this simple implementation in java for text strings in term of collisions and tested about > 20 hash codes implementations, but I didn't find any better alternative ...