target / data-validator

A tool to validate data, built around Apache Spark.
Other
99 stars 34 forks source link

Implement Column Statistics / Data Profiling for Numeric Columns #44

Open phpisciuneri opened 4 years ago

phpisciuneri commented 4 years ago

As discussed in our original Spark Summit presentation: See 22 min mark.

Listening to myself is awful btw.

Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.

phpisciuneri commented 4 years ago

Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include: