Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:
calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
leveraging existing hive/sql functions
exploring/using approximate methods for histograms, std dev, etc. on large data
As discussed in our original Spark Summit presentation: See 22 min mark.
Listening to myself is awful btw.
Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.