tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

String length Frequency Analysis #83

Closed mehrotragaurav closed 4 months ago

mehrotragaurav commented 4 months ago

Is there a possibility to include the frequency analysis for length of the strings available in a given column , providing output stating
there are 5 values with length 10 10 values with length 9 15 values with length 8 . . . . .

1 value with length as 1

sample output or may be something similar [ {length:"5",count:"10"}, {length:"10",count:"9"}, {length:"15",count:"8"} . . . . {length:"1" , count :"1"} ]

also one should be able to configure the limit the number of rows to be returned , top 20 / top 30 string length frequencies

tsegall commented 4 months ago

This seems like a relatively modest enhancement. Currently we distinguish between alphas and numerics and others - what is the use case for just string length? Is there any distinction made between 'hello world' and 'introducing'? Or do these simply both count as strings of length 11.

mehrotragaurav commented 4 months ago

Both the strings will count as 11 (hello world / introducing) .

tsegall commented 4 months ago

This has been addressed in 15.6.1 which should be available in ~24 hours in the usual repositories. Here is the JavaDoc ...

Get the trimmed string length frequencies. The first 127 elements reflect the number of strings of the length, i.e. if array[5] = 8 then there were 8 elements observed with length 5. The last element in the array reflects the number of elements observed with any length >= 127.

tsegall commented 4 months ago

Gaurav, Any feedback?

mehrotragaurav commented 4 months ago

hi @tsegall : Thanks for the follow-up , I have yet to check this piece . Give me a day or two