hpcc-systems / DataPatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
3 stars 4 forks source link

Break out detailed cardinality in a separate output/dataset #69

Closed gelliottrsg closed 2 years ago

gelliottrsg commented 3 years ago

Users would like to see the detailed value counts (cardinality) broken out all the way, which understandably would blow up the primary profiling dataset.

To speed up performance can we have the ability to turn off cardinality in the Profile function and then have an additional function that allows us to choose which columns we'd like a separate output of all the value counts (or choose all columns by default). If possible having all the columns broken out in this new output would allow users to join the field name back to the profile dataset for additional analysis. Leave that to you to determine to the performance limitations of such an approach.

dcamper commented 3 years ago

@gelliottrsg Cardinality can be disabled today by setting the 'features' parameter in Profile(). The default is 'all features' which is literally 'fill_rate,best_ecl_types,cardinality,cardinality_breakdown,modes,lengths,patterns,min_max,mean,std_dev,quartiles,correlations'. To omit cardinality in the results, pass in the explicit features you want, making sure to omit the 'cardinality' and 'cardinality_breakdown' items.

(From a Teams conversation on this topic: "A new function that OUTPUTs the complete breakdown for each attribute as separate workunit results. That would work, until one of them exceeds the 10MB limit on workunit results (but that can be lifted via an #OPTION)."

I'm not sure what you mean by joining the detailed/complete cardinality results back into the original. The limitation of 1000 cardinality breakdown values is due to a desire to limit the size of the child dataset. A detailed cardinality dataset could be considerably larger, which would break RAM limits the same way if it was joined with the original profile results. We'd be back to the original limitation, in other words. Is the idea of having separate workunit results not going to work in your use case?

One possible alternative is to build the detailed cardinality result as a single flat, three-field dataset (attribute name, cardinality value, cardinality count). In this case everything would be in globbed together and you could conceivably join between the original profile result and this dataset, linked by attribute name. This also has the advantage of returning a dataset, to be dealt with by the caller, rather than returning a series of OUTPUT actions that could be difficult to work with. Thoughts on this method?

gelliottrsg commented 3 years ago

On excluding cardinality and the breakdown: Awesome, Completely overlooked that. No issue there.

"In this case everything would be globbed together and you could conceivably join between the original profile result and this dataset, linked by attribute name"

This is kinda what I'm thinking as well. If there was an additional output/dataset function that could be called that included all the fields broken out in a 3 column output (and the ability to select specific fields that need a full breakout). See below: image

Then the users can do further analysis either within ECL or after exporting files.

dcamper commented 3 years ago

I think the three-field result dataset is the way to go, then. As far as field selection goes, I would imagine that it would be handy to be able to explicitly list the fields for which you want a full breakdown, like Profile() does today with the 'fieldListStr' parameter (the default empty string means "all fields" but you can provide a comma-delimited list of names to constrain that).

dcamper commented 3 years ago

New function Cardinality() will be available in v1.9.0. Awaiting external test/review.