Implement Column Statistics / Data Profiling for Numeric Columns

target / data-validator

A tool to validate data, built around Apache Spark.

Other

99 stars 34 forks source link

Open phpisciuneri opened 4 years ago

phpisciuneri commented 4 years ago

As discussed in our original Spark Summit presentation: See 22 min mark.

Listening to myself is awful btw.

Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.

phpisciuneri commented 4 years ago

Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:

calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
leveraging existing hive/sql functions
exploring/using approximate methods for histograms, std dev, etc. on large data