Open aalexandrov opened 8 years ago
I've written a small benchmark (how meta) to highlight the problem.
Here are the results:
##########################################################################################
# Distributions inverse CDF benchmark
##########################################################################################
Cardinality Uniform Binomial Normal Zipf Pareto
------------------------------------------------------------------------------------------
1000 2009.893ms 49049.008ms 1557.039ms 830303.17ms 14294.921ms
10000 1293.16ms 46005.583ms 1168.293ms 833500.448ms 14307.119ms
100000 1299.423ms 46028.881ms 1156.179ms 824108.238ms 14183.577ms
1000000 1276.722ms 45786.145ms 1127.982ms 831676.369ms 14477.197ms
10000000 1264.654ms 46654.514ms 1142.256ms 835947.764ms 14302.145ms
100000000 1320.517ms 47075.169ms 1165.2ms 835370.14ms 14216.341ms
Uniform
and Normal
distributions are fastest with roughly the same values, Pareto
is ~ factor 10 slower, followed Binomial
(~ 40x slower), and Zipf
(~ 650x slower).
I therefore propose to stick with the continuous probabilities and suitably discretize the inverse CDF values in order to approximate their discrete counterparts. The relationship between these pairs of distributions is explained in [1,2].
[1] Relationship between Binomial and Normal Distributions [2] Zipf, Power-laws, and Pareto - a ranking tutorial
The
commons-math3
distributions used in the reference data generator in the archetypes are really slow.During a local test of an experiment suite on which I am working with @ggevay I am observing the following numbers for generating
dataset.A
with key cardinality 100000:Uniform
key distribution, the job takes ~ 5 secondsBinomial
key distribution, the job takes ~ 25 secondsZipfian
key distribution, the datagen job exceeded the allowed limit of 600 seconds.The fix should be pushed to the peel-wordcount repository (see stratosphere/peel-wordcount#1).