peelframework / peel

Peel is a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms.
http://peel-framework.org
Apache License 2.0
27 stars 32 forks source link

Distribution sampling for Zipf and Binomial is too slow #82

Open aalexandrov opened 8 years ago

aalexandrov commented 8 years ago

The commons-math3 distributions used in the reference data generator in the archetypes are really slow.

During a local test of an experiment suite on which I am working with @ggevay I am observing the following numbers for generating dataset.A with key cardinality 100000:

The fix should be pushed to the peel-wordcount repository (see stratosphere/peel-wordcount#1).

aalexandrov commented 8 years ago

I've written a small benchmark (how meta) to highlight the problem.

Here are the results:

##########################################################################################
# Distributions inverse CDF benchmark
##########################################################################################

Cardinality    Uniform        Binomial       Normal         Zipf           Pareto         
------------------------------------------------------------------------------------------
1000           2009.893ms     49049.008ms    1557.039ms     830303.17ms    14294.921ms    
10000          1293.16ms      46005.583ms    1168.293ms     833500.448ms   14307.119ms    
100000         1299.423ms     46028.881ms    1156.179ms     824108.238ms   14183.577ms    
1000000        1276.722ms     45786.145ms    1127.982ms     831676.369ms   14477.197ms    
10000000       1264.654ms     46654.514ms    1142.256ms     835947.764ms   14302.145ms    
100000000      1320.517ms     47075.169ms    1165.2ms       835370.14ms    14216.341ms    

Uniform and Normal distributions are fastest with roughly the same values, Pareto is ~ factor 10 slower, followed Binomial (~ 40x slower), and Zipf (~ 650x slower).

I therefore propose to stick with the continuous probabilities and suitably discretize the inverse CDF values in order to approximate their discrete counterparts. The relationship between these pairs of distributions is explained in [1,2].

[1] Relationship between Binomial and Normal Distributions [2] Zipf, Power-laws, and Pareto - a ranking tutorial