ocadaruma / redis-hyperminhash

A HyperMinHash implementation for Redis.
Apache License 2.0
17 stars 0 forks source link

Increasing capacity #1

Open StephanSchmidt opened 4 years ago

StephanSchmidt commented 4 years ago

I've been playing with HyperMinHash and get around 0.5% error with an cardinality of 10.000 storing sha256 hashes. If I add 100.000 hashes the error increases to 10%.

Is there a way to reduce the error for larger cardinalities?

ocadaruma commented 4 years ago

Hm, increasing number of registers (https://github.com/ocadaruma/redis-hyperminhash/blob/master/src/hyperminhash/mod.rs#L7) could work but as hyperminhash is a probabilistic algorithm, possibility of such large estimation error still remains.

In an extreme example, we can construct an adversarial data set such that all elements calculate to same PatLen (https://github.com/ocadaruma/redis-hyperminhash/blob/master/src/hyperminhash/sketch.rs#L200)

In such case, estimated cardinality will be 1 regardless true cardinality.

By the way, I executed benchmark/accuracy.rb with changing cardinality to 100,000 then it looks like hyperminhash's precision is almost same as Redis's built-in HLL.

============== HyperMinHash ==============
97974- : ****
98172- : ****
98370- : ***********
98569- : ***********
98767- : **********************
98965- : ********************************
99164- : *******************************
99362- : ************************************
99560- : ******************************************
99759- : ***********************************************
99957- : *****************************************************
100155- : ************************************************************
100354- : ******************************************
100552- : ***********************************************
100750- : *************************
100949- : ***************
101147- : ************
101345- : ***
101544- : *
101742- : *
101941- : *
============== built-in HyperLogLog ==============
97890- : **
98086- : ***
98282- : *
98478- : **********
98674- : **************
98870- : *************************
99066- : ****************************
99263- : *****************************************
99459- : **************************************
99655- : *******************************************************
99851- : *************************************************************************
100047- : **********************************************
100243- : ******************************************
100439- : ***************************************
100636- : **************************
100832- : ************************
101028- : *************
101224- : **********
101420- : *****
101616- : ****
101813- : *
StephanSchmidt commented 4 years ago

Thank you, I'm new to this and my understanding isn't very deep.

May it be the case that my error gets worse because I insert SHA256 strings?

StephanSchmidt commented 4 years ago

I do think it might be a bug in my code, using as short Java program with LiveRamp HyperMinHash and 1M entries
adding the numbers from 1 to 1M results in:

SHA256 Bytes 995637 -0.43629999999999997

SHA256 String 997009 -0.29910000000000003

Int 1006880 0.688

So no degeneration of the error for 1M entries.

ocadaruma commented 4 years ago

Oh, I see. By the way, there's a slight difference in estimation algorithm LiveRamp's HyperMinHash uses. LiveRamp's one is based on LogLog beta, but redis-hyperminash's one is based on https://arxiv.org/abs/1702.01284 .

Though I don't expect this causes such high accuracy difference.

StephanSchmidt commented 4 years ago

Ok, I'll find the problem in my code :-) Thanks for writing this code!

I might want to use this in a SaaS startup for analytics, would you be confident in your code and me to use this?

ocadaruma commented 4 years ago

Oh, glad to hear that!

I think my code would work at a certain level as I performed benchmark and accuracy tests so far, but basically redis-hyperminhash is a hobby project which purposing to study how to write Redis-module in Rust. And please note that my code isn't tested on any production environment.

StephanSchmidt commented 4 years ago

Thanks, will take your points into account.