Closed Redsandro closed 7 years ago
I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.
Keeping the code size small is very important. I think if you want to use some features from mingo
with an very optimal implementation, you can roll out your own custom operator. If your dataset is very large and you aren't in the browser, then you can just use MongoDB
directly by way of the mongoimport command.
I am quite curious to know how you came to these numbers. hashCode
is generally considered a good enough (in both speed and collision resistance) for simple cases.
murmurhash3 (pure javascript) is about 140% as fast as hashCode. cityhash (c++ node.js module) is about 280% as fast as hashCode
@kofrasa said:
I am not keen to take on a dependency. The current hash function can be optimized a bit more or substituted for a faster smaller alternative, but not as an external dependency.
Ok, I can understand that. We differ on persective a bit and that's fine. Is making the hash function extensible an interesting alternative? Everything would work as it works now, except you could optionally overrule the function like this:
mingo.setup({
key : '_id',
hashFn : function(val) {
return murmurHash3.x86.hash32(val, 22)
}
});
mingo
could be used as is, or with a js library for browser implementations, or with a c++ library for node.js implementations. Everyone wins.
I am quite curious to know how you came to these numbers.
Honestly, I remember from previous research when I wanted to optimize hashing for object comparisons within large arrays of objects. I think I can find something when I google my name.
-update-
Here is a jsperf comparison I did in 2015:
string-hash
(removed from screenshot for brevity) is the clear winner but seemed very unreliable because of hash collisions.This was for javascript solutions, but later on I went to investigate native solutions.
I found this article with the following results:
Speed depends on input size, but for 2 to 32 characters, cityHash is 200 to 500% faster.
Note that cityHash
is from Google and MurmurHash3
got someone hired at Google (or the other way around).
-update-
This article on Reddit might be relevant to your interests.
https://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/
-update-
What is also interesting is that native functions like cityHash
might make it's way into javascript when WebAssembly makes it way into mainstream in a year or two.
Thanks for the benchmarks.
While we have numbers, it makes sense to take the fastest and relatively smaller algorithm.
string-hash
in particular was up for consideration when I choose the hash function, but I run no benchmark. From the data, the speed is uncontested however, we don't know about how collision resistant it is. This may have influenced my choice at the time but I do not remember.
Exposing an interface to override the hash function will merely create a contract that must be supported in the future. I would argue that is not necessary. It is an implementation detail which is not a hard requirement. I might as well compare the stringified results of the inputs directly which will work fine and guarantee better correctness. The only purpose of the hash was to reduce the footprint of the amount of data required to be in memory, during certain operations.
In the future, those details could change in a manner transparent to users.
Ok, gotcha.
mingo.setup({
key : '_id',
hashFn : val => val
});
:wink:
Am I correct to assume that your preferences for mingo
make it unlikely that you'll accept PRs in the future related to small* performance optimizations or extensibility of internals?
*) 'small' as in trivial for smaller datasets (e.g. todo-list), but relevant for larger datasets (e.g. data analysis)
PR's are welcome (fixes, new operators, internal improvements etc), but like any dev work, they are subject to project constraints.
This project environment is bias towards the browser, that means it should have a small footprint, be reasonably fast, provide the most useful query features.
There are numerous alternatives on the server side of things including using an actual MongoDB database with access to full indexing support if needed. You mentioned LokiJS
for example, which says on the site
LokiJS is an in-memory database which prioritises performance over everything They also have MongoDB API compatibility on their road map so I presume it is just a matter of time.
Back to your benchmark, I am curious to see how the hashCode
implementation performs if written differently.
current
for (i = 0; i < testCount; i++) {
//from http://goo.gl/lY5ww
var s = 0;
for (var ii = 0, l = testStrings[i].length; ii < l; ii++) {
s = ((s << 5) - s) + testStrings[i][ii].charCodeAt(0);
s = s & s;
}
}
this is the slightly different version mingo
uses.
for (i = 0; i < testCount; i++) {
var h = 0, s = testStrings[i];
var x = s.length;
while (x) {
(h << 5) - h + s.charCodeAt(--x)
h |= 0
}
}
Would it be possible to add this to your benchmark suite?
Would it be possible to add this to your benchmark suite?
Yes, although jsperf has been 502
all day. I'll try when it's up again.
@kofrasa
jsperf is finally up again. I've added the test. I also found that I made some mistakes in the string-hash test in 2015, it's so fast because it breaks early.
Here are the new tests: https://jsperf.com/murmurhash3-vs-xxhash
And you're right. Your implementation of hashCode is faster.
Meanwhile, I did read that hashCode
is likely to collide at least once when using more than 77.000 hashes.
That's more than I expected. It's not as much about the algorhythm as it's about the 32 bitness.
With this information, I think it's more serious now (perhaps even crucial) (have the option to) disable hashing of match fields in for example $look
, or at least use 64 bit hashes (which are slower across the board).
MurmurHash can do 64 bit and 128 bit hex.
I had doubts about this find, so I decided to write my own benchmark. It follows closely what you have in JSperf but can be run be locally.
All the algorithms compare well, however hashCode
gives the most preferred results for time and collision. It is also well behaved and has less variation between runs for different inputs.
The benchmark excludes the actual first run for each algorithm, which is not shown in the output. See the script here
I suspect the data you have is either for a different algorithm or for a specific kind of input (e.g. serial numbers, names, zipcodes etc).
data size: 1000000
====================
hashcode:
{"time":[4887,5275,5381,5973],"collisions":109,"avg":5379}
djb2:
{"time":[4662,5390,5107,5286],"collisions":129,"avg":5111.25}
murmurhash3:
{"time":[5663,6046,5917,5966],"collisions":129,"avg":5898}
sdbm:
{"time":[6254,6391,6799,6457],"collisions":122,"avg":6475.25}
data size: 1000000
====================
hashcode:
{"time":[5294,5582,5611,5666],"collisions":115,"avg":5538.25}
djb2:
{"time":[5615,5625,5414,5574],"collisions":129,"avg":5557}
murmurhash3:
{"time":[8046,6030,5985,5916],"collisions":104,"avg":6494.25}
sdbm:
{"time":[6368,6521,6480,8156],"collisions":118,"avg":6881.25}
data size: 1000000
====================
hashcode:
{"time":[5399,5329,5419,5408],"collisions":122,"avg":5388.75}
djb2:
{"time":[5266,5348,5367,7305],"collisions":122,"avg":5821.5}
murmurhash3:
{"time":[5833,5890,5712,7132],"collisions":106,"avg":6141.75}
sdbm:
{"time":[6343,6336,6293,6416],"collisions":108,"avg":6347}
After some tweaks and a few more runs, it turns out the XOR version of the djb2
algorithm provides better performance per collision.
[UPDATE]
hashCode
still performs better
Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo
Here an exemple of processing error that it avoid :
import mingo from 'mingo'
const collection = [
{
"codes": ["KNE_OC42-midas"]
},
{
"codes": ["KNE_OCS3-midas"]
}
];
const criteria = {
"codes": { "$in": ["KNE_OCS3-midas"] }
};
console.log(`---- START ----`);
let query = new mingo.Query(criteria)
const cursor = query.find(collection)
// print all match, should match only the 2nd
while (cursor.hasNext()) {
console.log(cursor.next())
}
console.log(`---- END ----`)
/* Output :
---- START ----
{ codes: [ 'KNE_OC42-midas' ] }
{ codes: [ 'KNE_OCS3-midas' ] }
---- END ----
*/
I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.
Hi Ben,
I think your request sounds reasonable. Different use cases may need different hashing guarantees.
I will take a look at the PR soon.
Regards, Francis
On Mon, Nov 30, 2020 at 6:25 PM Benjamin DANIEL notifications@github.com wrote:
Here a fork to avoid hash collision, but loosing performances : https://github.com/BenjD90/mingo
Here an exemple of processing error that it avoid :
import mingo from 'mingo' const collection = [ { "codes": ["KNE_OC42-midas"] }, { "codes": ["KNE_OCS3-midas"] }]; const criteria = { "codes": { "$in": ["KNE_OCS3-midas"] }}; console.log(
---- START ----
); let query = new mingo.Query(criteria)const cursor = query.find(collection) // print all match, should match only the 2ndwhile (cursor.hasNext()) { console.log(cursor.next())} console.log(---- END ----
) / Output :---- START ----{ codes: [ 'KNE_OC42-midas' ] }{ codes: [ 'KNE_OCS3-midas' ] }---- END ----/I'll try do make de PR soon, to allow user to give it's own hash function, and tell it in the documentation.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kofrasa/mingo/issues/62#issuecomment-735891543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHF7EPQHIKYLBYGFUDC6CLSSPBPHANCNFSM4DZ2RXRA .
Your hasing function looks like the
hashCode
algorhythm. That correct?Hypothetically
Would you be willing to have any dependencies? Would you be willing to have different dependencies for the
npm/node.js
version?murmurhash3
(pure javascript) is about 140% as fast ashashCode
.cityhash
(c++ node.js module) is about 280% as fast ashashCode