Thoughts on hashing optimizations

Redsandro commented 7 years ago

Your hasing function looks like the hashCode algorhythm. That correct?

Hypothetically

Would you be willing to have any dependencies? Would you be willing to have different dependencies for the npm/node.js version?

murmurhash3 (pure javascript) is about 140% as fast as hashCode.
cityhash (c++ node.js module) is about 280% as fast as hashCode

kofrasa commented 7 years ago

I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.

Keeping the code size small is very important. I think if you want to use some features from mingo with an very optimal implementation, you can roll out your own custom operator. If your dataset is very large and you aren't in the browser, then you can just use MongoDB directly by way of the mongoimport command.

kofrasa commented 7 years ago

I am quite curious to know how you came to these numbers. hashCode is generally considered a good enough (in both speed and collision resistance) for simple cases.

murmurhash3 (pure javascript) is about 140% as fast as hashCode. cityhash (c++ node.js module) is about 280% as fast as hashCode

Redsandro commented 7 years ago

@kofrasa said:

I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.

Ok, I can understand that. We differ on persective a bit and that's fine. Is making the hash function extensible an interesting alternative? Everything would work as it works now, except you could optionally overrule the function like this:

mingo.setup({
    key    : '_id',
    hashFn : function(val) {
        return murmurHash3.x86.hash32(val, 22)
    }
});

mingo could be used as is, or with a js library for browser implementations, or with a c++ library for node.js implementations. Everyone wins.

I am quite curious to know how you came to these numbers.

Honestly, I remember from previous research when I wanted to optimize hashing for object comparisons within large arrays of objects. I think I can find something when I google my name.

-update-

Here is a jsperf comparison I did in 2015:

The comparison was based on the work of others and I'm sure there are revisions after me.
Personally I am only interested in Chrome results (because the same V8 js engine powers node.js) but speed in other browsers may vary.
string-hash (removed from screenshot for brevity) is the clear winner but seemed very unreliable because of hash collisions.
410 / 310 * 100 = 132%. I swear it was 140% when I tested in 2015. :stuck_out_tongue_closed_eyes:

This was for javascript solutions, but later on I went to investigate native solutions.

I found this article with the following results:

Speed depends on input size, but for 2 to 32 characters, cityHash is 200 to 500% faster.

Note that cityHash is from Google and MurmurHash3 got someone hired at Google (or the other way around).

-update-

This article on Reddit might be relevant to your interests.
https://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/

-update-

What is also interesting is that native functions like cityHash might make it's way into javascript when WebAssembly makes it way into mainstream in a year or two.

kofrasa commented 7 years ago

Thanks for the benchmarks.

While we have numbers, it makes sense to take the fastest and relatively smaller algorithm. string-hash in particular was up for consideration when I choose the hash function, but I run no benchmark. From the data, the speed is uncontested however, we don't know about how collision resistant it is. This may have influenced my choice at the time but I do not remember.

Exposing an interface to override the hash function will merely create a contract that must be supported in the future. I would argue that is not necessary. It is an implementation detail which is not a hard requirement. I might as well compare the stringified results of the inputs directly which will work fine and guarantee better correctness. The only purpose of the hash was to reduce the footprint of the amount of data required to be in memory, during certain operations.

In the future, those details could change in a manner transparent to users.

Redsandro commented 7 years ago

Ok, gotcha.

mingo.setup({
    key    : '_id',
    hashFn : val => val
});

:wink:

Am I correct to assume that your preferences for mingo make it unlikely that you'll accept PRs in the future related to small* performance optimizations or extensibility of internals?

*) 'small' as in trivial for smaller datasets (e.g. todo-list), but relevant for larger datasets (e.g. data analysis)

kofrasa commented 7 years ago

PR's are welcome (fixes, new operators, internal improvements etc), but like any dev work, they are subject to project constraints.

This project environment is bias towards the browser, that means it should have a small footprint, be reasonably fast, provide the most useful query features.

There are numerous alternatives on the server side of things including using an actual MongoDB database with access to full indexing support if needed. You mentioned LokiJS for example, which says on the site

LokiJS is an in-memory database which prioritises performance over everything They also have MongoDB API compatibility on their road map so I presume it is just a matter of time.

Back to your benchmark, I am curious to see how the hashCode implementation performs if written differently.

current

for (i = 0; i < testCount; i++) {
  //from http://goo.gl/lY5ww
  var s = 0;
  for (var ii = 0, l = testStrings[i].length; ii < l; ii++) {
    s = ((s << 5) - s) + testStrings[i][ii].charCodeAt(0);
    s = s & s;
  }
}

this is the slightly different version mingo uses.

for (i = 0; i < testCount; i++) {
  var h = 0, s = testStrings[i];
  var x = s.length;
  while (x) {
    (h << 5) - h + s.charCodeAt(--x)
    h |= 0
  }
}

Would it be possible to add this to your benchmark suite?

Redsandro commented 7 years ago

Would it be possible to add this to your benchmark suite?

Yes, although jsperf has been 502 all day. I'll try when it's up again.

Redsandro commented 7 years ago

@kofrasa

jsperf is finally up again. I've added the test. I also found that I made some mistakes in the string-hash test in 2015, it's so fast because it breaks early.

Here are the new tests: https://jsperf.com/murmurhash3-vs-xxhash

And you're right. Your implementation of hashCode is faster.

Meanwhile, I did read that hashCode is likely to collide at least once when using more than 77.000 hashes.

That's more than I expected. It's not as much about the algorhythm as it's about the 32 bitness.

With this information, I think it's more serious now (perhaps even crucial) (have the option to) disable hashing of match fields in for example $look, or at least use 64 bit hashes (which are slower across the board).

MurmurHash can do 64 bit and 128 bit hex.

kofrasa commented 7 years ago

I had doubts about this find, so I decided to write my own benchmark. It follows closely what you have in JSperf but can be run be locally.

All the algorithms compare well, however hashCode gives the most preferred results for time and collision. It is also well behaved and has less variation between runs for different inputs.

The benchmark excludes the actual first run for each algorithm, which is not shown in the output. See the script here

I suspect the data you have is either for a different algorithm or for a specific kind of input (e.g. serial numbers, names, zipcodes etc).

first

data size: 1000000
====================
hashcode:
{"time":[4887,5275,5381,5973],"collisions":109,"avg":5379}

djb2:
{"time":[4662,5390,5107,5286],"collisions":129,"avg":5111.25}

murmurhash3:
{"time":[5663,6046,5917,5966],"collisions":129,"avg":5898}

sdbm:
{"time":[6254,6391,6799,6457],"collisions":122,"avg":6475.25}

second

data size: 1000000
====================
hashcode:
{"time":[5294,5582,5611,5666],"collisions":115,"avg":5538.25}

djb2:
{"time":[5615,5625,5414,5574],"collisions":129,"avg":5557}

murmurhash3:
{"time":[8046,6030,5985,5916],"collisions":104,"avg":6494.25}

sdbm:
{"time":[6368,6521,6480,8156],"collisions":118,"avg":6881.25}

third

data size: 1000000
====================
hashcode:
{"time":[5399,5329,5419,5408],"collisions":122,"avg":5388.75}

djb2:
{"time":[5266,5348,5367,7305],"collisions":122,"avg":5821.5}

murmurhash3:
{"time":[5833,5890,5712,7132],"collisions":106,"avg":6141.75}

sdbm:
{"time":[6343,6336,6293,6416],"collisions":108,"avg":6347}

kofrasa commented 7 years ago

After some tweaks and a few more runs, it turns out the XOR version of the djb2 algorithm provides better performance per collision. [UPDATE] hashCode still performs better

BenjD90 commented 3 years ago

Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo

Here an exemple of processing error that it avoid :

import mingo from 'mingo'

const collection = [
  {
    "codes": ["KNE_OC42-midas"]
  },
  {
    "codes": ["KNE_OCS3-midas"]
  }
];

const criteria = {
  "codes": { "$in": ["KNE_OCS3-midas"] }
};

console.log(`---- START ----`);

let query = new mingo.Query(criteria)
const cursor = query.find(collection)

// print all match, should match only the 2nd
while (cursor.hasNext()) {
  console.log(cursor.next())
}

console.log(`---- END ----`)

/* Output :
---- START ----
{ codes: [ 'KNE_OC42-midas' ] }
{ codes: [ 'KNE_OCS3-midas' ] }
---- END ----
*/

I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.

kofrasa commented 3 years ago

Hi Ben,

I think your request sounds reasonable. Different use cases may need different hashing guarantees.

I will take a look at the PR soon.

Regards, Francis

On Mon, Nov 30, 2020 at 6:25 PM Benjamin DANIEL notifications@github.com wrote:

Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo

Here an exemple of processing error that it avoid :

import mingo from 'mingo' const collection = [ { "codes": ["KNE_OC42-midas"] }, { "codes": ["KNE_OCS3-midas"] }]; const criteria = { "codes": { "$in": ["KNE_OCS3-midas"] }}; console.log(---- START ----); let query = new mingo.Query(criteria)const cursor = query.find(collection) // print all match, should match only the 2ndwhile (cursor.hasNext()) { console.log(cursor.next())} console.log(---- END ----) / Output :---- START ----{ codes: [ 'KNE_OC42-midas' ] }{ codes: [ 'KNE_OCS3-midas' ] }---- END ----/

I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kofrasa/mingo/issues/62#issuecomment-735891543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHF7EPQHIKYLBYGFUDC6CLSSPBPHANCNFSM4DZ2RXRA .

kofrasa / mingo