We need plots showing the performance of our code as a function of the quantity of resources (nodes / CPUs) we throw at the problem.
In theory, it should be a linear speed-up--as we add more nodes/CPUs, it should run faster on the same dataset (though make sure the number of partitions of the RDD increase accordingly).
We need plots showing the performance of our code as a function of the quantity of resources (nodes / CPUs) we throw at the problem.
In theory, it should be a linear speed-up--as we add more nodes/CPUs, it should run faster on the same dataset (though make sure the number of partitions of the RDD increase accordingly).