Closed sezruby closed 2 years ago
In the textual form, Y-axis has labels "0%" and "100%", whereas there are actual values from 0 to 240 in the HTML form. Why the difference? It would be more consistent if both forms have the same content.
The chart produced by the function looks very nice. But how about plotting min-max ranges for each file as we did in previous presentations (optionally sorted by min, max, or file path) in addition to the chart? That would explain the "why" of the chart, and some users like me might like to know more details.
In the textual form, Y-axis has labels "0%" and "100%", whereas there are actual values from 0 to 240 in the HTML form. Why the difference? It would be more consistent if both forms have the same content.
The text form is not that flexible(?) so I wrote %, but in html form I can use 0 to max number of files. I agree about that consistency; will try to change html form to 0%-100%
The chart produced by the function looks very nice. But how about plotting min-max ranges for each file as we did in previous presentations (optionally sorted by min, max, or file path) in addition to the chart? That would explain the "why" of the chart, and some users like me might like to know more details.
I think we can extend the feature later :) I tried to show min-max range of each file first (there's some commented out code for file level distribution) but it's bit tricky to show it in text format in case there're millions of files.
I think we can split the code into two parts, data/histogram generation and presentation, with the presentation part being specialized for two variants, html and text. It would be easier to evolve the code later if we need to. Until then, it seems just okay for now.
@clee704 could you review and vote the PR? Thanks!
What is the context for this pull request?
What changes were proposed in this pull request?
Introduce utility function to analyze the data layout of the given source data.
Spark and parquet perform min/max pruning based on statistics, to improve query performance. This function helps to understand the physical layout of column values by showing some statistics based on min/max for each file.
The function returns html format or text format.
This is an example of HTML format result:
An example for text result:
The utility function only supports NumericType.
Does this PR introduce any user-facing change?
Yes, provide an utility function for min/max analysis.
How was this patch tested?
via Notebook.