Data size vs speed up experiments

Using the same cluster resources, we need a graph showing the performance of our algorithm as the size of the underlying dataset increases. In theory, this should be a linear slowdown.

We could also do a sub-experiment here that varies the number of RDD partitions and observes the performance (holding the cluster resources and the data size constant).

quinngroup / dr1dl-pyspark

Data size vs speed up experiments #57