Set and Map are vulnerable to hash collisions

bnoordhuis commented 9 months ago

import * as os from "os"
function go(a) {
    const s = new Set()
    const t0 = os.now()
    for (let i = a.length; i-- > 0; s.add(a[i]));
    const t1 = os.now()
    return t1-t0
}
function fmt(x) {
    return ("       " + x).slice(-7)
}
function bench(c) {
    const p = ("0000" + c.charCodeAt(0).toString(16)).slice(-4)
    const k = c.repeat(0x1000) + "k"
    const a = []
    for (let i = k.length; i-- > 0;) a.push(k.slice(i))
    let b = []
    for (let i = 0; i < 20; i++) b.push(go(a))
    b.sort((x, y) => x-y)
    b = b.map(fmt)
    print(p, b[0], b[9], b[19]); // min, median, max
}
for (let i = 0; i < 128; i++) bench(String.fromCharCode(i))

Prints:

0000  869180  886624  905783  # 8x slower on 64k inserts!
0001   95426  102488  157594
0002  100252  106452  172442
0003   96300  102617  128371
0004  112713  114385  229245
0005   95087  102894  108105
0006  105638  107379  161700
0007  102093  105335  168217
...

I exploit a weakness of map_hash_key() by inserting keys with ever longer zero prefixes - "k", "\0k", "\0\0k", etc. - that all hash to the same value and end up in the same hash bucket.

Doubles-as-keys are exploitable too; maybe ints too because they're cast to doubles before hashing. The hash algorithm is this:

u.d = d;
h = (u.u32[0] ^ u.u32[1]) * 3163;

But there are literally millions of normal and subnormal doubles with high and low words that are bitwise identical and cancel each other out when xor'ed.

bnoordhuis commented 9 months ago

Tangential: other JS engines (and python, ruby, etc.) added random seeds to their hash tables about 10 years ago to mitigate DoS attacks. We probably want to do likewise.

chqrlie commented 7 months ago

As the comment indicates, /* XXX: better hash ? */, the hash methods are simplistic and easy to defeat.

Combining the string length into the hash value is a simple way to ~~fix~~ hide the problem. Implementing a fast reliable hash is of course a better approach.

Note by the way that the hash performed on BigInt values might pose problems too as I am not convinced the representation in the array is necessarily normalized. I discussed this with Fabrice and we agree a fresh biginteger package is necessary and would also vastly improve the bigint performance.

bnoordhuis commented 6 months ago

a fresh biginteger package is necessary and would also vastly improve the bigint performance

FWIW, I thought along similar lines and even made a start but ran out of time/steam.

chqrlie commented 6 months ago

a fresh biginteger package is necessary and would also vastly improve the bigint performance

FWIW, I thought along similar lines and even made a start but ran out of time/steam.

Fabrice has a few in his treasure chest, let's wait a bit to get that when he is ready.

quickjs-ng / quickjs

Set and Map are vulnerable to hash collisions #205