hpcc-systems / DataPatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
3 stars 4 forks source link

Store a table of common pattern value resolutions with Similarity Percentage (confidence) #70

Open gelliottrsg opened 3 years ago

gelliottrsg commented 3 years ago

Users love the pattern detection but would like to leverage those patterns against a dataset that keeps the most common resolution to those patterns as a potential 1 to many name value lookup. For instance patterns of 9999999999, 999-999-9999, +9 9999999999 would have values in this new dataset that flag it as a potential phone number. A sample of the output could look like the attached image.

image

dcamper commented 3 years ago

@gelliottrsg This is a good idea. A couple of questions:

  1. I feel that the meaning behind patterns is possibly specific to a use-case. There are very few patterns that would actually be globally true (latitude/longitude comes to mind as one example). Phone numbers are not global but there is a finite set of patterns for them, so they would be harder but doable. SSN patterns could be easily confused with other things. The point is, does it make sense for this functionality to have a dictionary of pattern->meaning pairs built in, or require the caller to supply the dictionary?
  2. How do you envision the 'similarity percent' and 'resolution ranking' values in your example to be computed? The similarity value could be "number of records matching that pattern out of the total number of records" but that is not clear.