Open StephanSchmidt opened 4 years ago
Hm, increasing number of registers (https://github.com/ocadaruma/redis-hyperminhash/blob/master/src/hyperminhash/mod.rs#L7) could work but as hyperminhash is a probabilistic algorithm, possibility of such large estimation error still remains.
In an extreme example, we can construct an adversarial data set such that all elements calculate to same PatLen (https://github.com/ocadaruma/redis-hyperminhash/blob/master/src/hyperminhash/sketch.rs#L200)
In such case, estimated cardinality will be 1 regardless true cardinality.
By the way, I executed benchmark/accuracy.rb
with changing cardinality to 100,000 then it looks like hyperminhash's precision is almost same as Redis's built-in HLL.
============== HyperMinHash ==============
97974- : ****
98172- : ****
98370- : ***********
98569- : ***********
98767- : **********************
98965- : ********************************
99164- : *******************************
99362- : ************************************
99560- : ******************************************
99759- : ***********************************************
99957- : *****************************************************
100155- : ************************************************************
100354- : ******************************************
100552- : ***********************************************
100750- : *************************
100949- : ***************
101147- : ************
101345- : ***
101544- : *
101742- : *
101941- : *
============== built-in HyperLogLog ==============
97890- : **
98086- : ***
98282- : *
98478- : **********
98674- : **************
98870- : *************************
99066- : ****************************
99263- : *****************************************
99459- : **************************************
99655- : *******************************************************
99851- : *************************************************************************
100047- : **********************************************
100243- : ******************************************
100439- : ***************************************
100636- : **************************
100832- : ************************
101028- : *************
101224- : **********
101420- : *****
101616- : ****
101813- : *
Thank you, I'm new to this and my understanding isn't very deep.
May it be the case that my error gets worse because I insert SHA256 strings?
I do think it might be a bug in my code, using as short Java program with LiveRamp HyperMinHash and 1M entries
adding the numbers from 1 to 1M results in:
SHA256 Bytes 995637 -0.43629999999999997
SHA256 String 997009 -0.29910000000000003
Int 1006880 0.688
So no degeneration of the error for 1M entries.
Oh, I see. By the way, there's a slight difference in estimation algorithm LiveRamp's HyperMinHash uses. LiveRamp's one is based on LogLog beta, but redis-hyperminash's one is based on https://arxiv.org/abs/1702.01284 .
Though I don't expect this causes such high accuracy difference.
Ok, I'll find the problem in my code :-) Thanks for writing this code!
I might want to use this in a SaaS startup for analytics, would you be confident in your code and me to use this?
Oh, glad to hear that!
I think my code would work at a certain level as I performed benchmark and accuracy tests so far, but basically redis-hyperminhash is a hobby project which purposing to study how to write Redis-module in Rust. And please note that my code isn't tested on any production environment.
Thanks, will take your points into account.
I've been playing with HyperMinHash and get around 0.5% error with an cardinality of 10.000 storing sha256 hashes. If I add 100.000 hashes the error increases to 10%.
Is there a way to reduce the error for larger cardinalities?