tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

Unique Count of the values for a column/field #82

Closed mehrotragaurav closed 4 months ago

mehrotragaurav commented 4 months ago

As on date FTA is able to return the unique count percentage for a field , is there a possibility to have the unique count as well in parallel to unique count percentage

tsegall commented 4 months ago

The unique count should be just the * both of which are returned?

So given the data set below:

FieldToAnalyze 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 21 21 22 22

The analysis returned is ...

Field 'FieldToAnalyze' (0) - { "fieldName" : "FieldToAnalyze", "totalCount" : 25, "sampleCount" : 25, "matchCount" : 25, "nullCount" : 0, "blankCount" : 0, "distinctCount" : 22, "regExp" : "\d{1,2}", "confidence" : 1.0, "type" : "Long", "isSemanticType" : false, "min" : "1", "max" : "22", "minLength" : 1, "maxLength" : 2, "mean" : 12.680000000000001, "standardDeviation" : 6.7570407724091766, "topK" : [ "22", "21", "20", "19", "18", "17", "16", "15", "14", "13" ], "bottomK" : [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" ], "leadingZeroCount" : 0, "cardinality" : 22, "cardinalityDetail" : [ { "key" : "21", "count" : 3 }, { "key" : "22", "count" : 2 }, { "key" : "1", "count" : 1 }, { "key" : "10", "count" : 1 }, { "key" : "11", "count" : 1 }, { "key" : "12", "count" : 1 }, { "key" : "13", "count" : 1 }, { "key" : "14", "count" : 1 }, { "key" : "15", "count" : 1 }, { "key" : "16", "count" : 1 }, { "key" : "17", "count" : 1 }, { "key" : "18", "count" : 1 }, { "key" : "19", "count" : 1 }, { "key" : "2", "count" : 1 }, { "key" : "20", "count" : 1 }, { "key" : "3", "count" : 1 }, { "key" : "4", "count" : 1 }, { "key" : "5", "count" : 1 }, { "key" : "6", "count" : 1 }, { "key" : "7", "count" : 1 }, { "key" : "8", "count" : 1 }, { "key" : "9", "count" : 1 } ], "outlierCardinality" : 0, "invalidCardinality" : 0, "shapesCardinality" : 2, "shapesDetail" : [ { "key" : "99", "count" : 16 }, { "key" : "9", "count" : 9 } ], "percentiles" : [ "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "3", "4", "4", "4", "5", "5", "5", "5", "6", "6", "6", "6", "7", "7", "7", "7", "8", "8", "8", "8", "9", "9", "9", "9", "10", "10", "10", "10", "11", "11", "11", "11", "12", "12", "12", "12", "13", "13", "13", "13", "14", "14", "14", "14", "15", "15", "15", "15", "16", "16", "16", "16", "17", "17", "17", "17", "18", "18", "18", "18", "19", "19", "19", "19", "20", "20", "20", "20", "21", "21", "21", "21", "21", "21", "21", "21", "21", "21", "21", "21", "22", "22", "22", "22", "22", "22", "22" ], "histogram" : [ 3, 2, 2, 2, 2, 2, 2, 2, 2, 6 ], "leadingWhiteSpace" : false, "trailingWhiteSpace" : false, "multiline" : false, "keyConfidence" : 0.0, "uniqueness" : 0.9090909090909091, "detectionLocale" : "en-US", "ftaVersion" : "15.5.6", "structureSignature" : "B/cFTuoiDSMz57BbEz4yMKqNg60=", "dataSignature" : "xF1cdi2lYha79lGrz71TKT/n/gI=" }

And the uniqueness * cardinality gives you 20 which is the unique count?

mehrotragaurav commented 4 months ago

@tsegall : This may provide a correct result with a relative small size data , if we have to ascertain for a much larger dataset (multi million records or so ) this may still be an approx count and not the exact , thoughts ?

tsegall commented 4 months ago

I am not sure I quite understand your objective. Let me attempt to clarify. Currently FTA only counts uniqueness where the cardinality of the set is less than maxCardinality (12,000 by default). So assuming the number of distinct elements in the set is less than 12,000 which is typically highly likely then the uniqueness percentage (and hence the uniqueness count) should be perfect. Once the number of distinct elements exceeds the defined maximum then FTA simply reports -1 to indicate it does not know the answer.

The obvious solution to this problem is to use something like Hyperloglog (see https://en.wikipedia.org/wiki/HyperLogLog) - note: I need to use a variant that supports merging. This would then allow us to implement an approx_count_distinct and hence generate a uniqueness metric. This would be extremely close but not perfect. It is unclear from your response whether this would satisfy your requirement. In general, implementing perfect uniqueness is extremely computationally expensive and rarely required, hence the reason databases implement approx_count_distinct.

tsegall commented 4 months ago

Gaurav,

Any feedback?

tsegall commented 4 months ago

Closing this question as answered. See #92 for the corresponding enhancement.