Introduce utility function to analyze min/max data layout

sezruby commented 3 years ago

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

Introduce utility function to analyze the data layout of the given source data.

Spark and parquet perform min/max pruning based on statistics, to improve query performance. This function helps to understand the physical layout of column values by showing some statistics based on min/max for each file.

The function returns html format or text format.

This is an example of HTML format result:

displayHTML(DataAnalysisUtils.minMaxAnalysis(df, Seq("colA"), format="html"))

quickstart-optimizezorderby-result-colA-0

An example for text result:

println(DataAnalysisUtils.minMaxAnalysis(df, Seq("colA"), format="text"))

Min/Max analysis on colA

                  < Number of files (%) >
     +--------------------------------------------------+
100% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 75% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 50% |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
     |                                                  |
 25% |                                                  |
     |   ********  *****     **************             |
     |  **********************************************  |
     |**************************************************|
     |**************************************************|
  0% |**************************************************|
     +--------------------------------------------------+
     Min <-----           colA value           -----> Max

min(colA): 0
max(colA): 9999999
Total num of files: 144
Total byte size of files: 416776899
Max. num of files for a point lookup: 36 (25.00%)
Estimated average num of files for a point lookup: 29.44 (20.44%)
Max. bytes to read for a point lookup: 104095313 (24.98%)

The utility function only supports NumericType.

Does this PR introduce any user-facing change?

Yes, provide an utility function for min/max analysis.

How was this patch tested?

via Notebook.

clee704 commented 3 years ago

In the textual form, Y-axis has labels "0%" and "100%", whereas there are actual values from 0 to 240 in the HTML form. Why the difference? It would be more consistent if both forms have the same content.

clee704 commented 3 years ago

The chart produced by the function looks very nice. But how about plotting min-max ranges for each file as we did in previous presentations (optionally sorted by min, max, or file path) in addition to the chart? That would explain the "why" of the chart, and some users like me might like to know more details.

sezruby commented 3 years ago

In the textual form, Y-axis has labels "0%" and "100%", whereas there are actual values from 0 to 240 in the HTML form. Why the difference? It would be more consistent if both forms have the same content.

The text form is not that flexible(?) so I wrote %, but in html form I can use 0 to max number of files. I agree about that consistency; will try to change html form to 0%-100%

The chart produced by the function looks very nice. But how about plotting min-max ranges for each file as we did in previous presentations (optionally sorted by min, max, or file path) in addition to the chart? That would explain the "why" of the chart, and some users like me might like to know more details.

I think we can extend the feature later :) I tried to show min-max range of each file first (there's some commented out code for file level distribution) but it's bit tricky to show it in text format in case there're millions of files.

clee704 commented 3 years ago

I think we can split the code into two parts, data/histogram generation and presentation, with the presentation part being specialized for two variants, html and text. It would be easier to evolve the code later if we need to. Until then, it seems just okay for now.

sezruby commented 3 years ago

@clee704 could you review and vote the PR? Thanks!

microsoft / hyperspace