tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

Unable to identify Semantic type when TextAnalyzer name is generic. #34

Closed v4mishra closed 1 year ago

v4mishra commented 1 year ago

Hello,

I am using fta 13.3.0 and need to detect semantic type based on most frequent items.

Code is simlar to below :

` @Test public void testFTA() throws FTAPluginException, FTAUnsupportedLocaleException { LinkedHashMap<String,Long> freqMap = new LinkedHashMap<>(); freqMap.put("Mumbai", 50l); freqMap.put("Delhi", 20l); freqMap.put("New York", 10l); freqMap.put("Ottawa", 70l); freqMap.put("Paris", 40l); freqMap.put("London", 30l); freqMap.put("Dallas", 90l); final TextAnalyzer analysis = new TextAnalyzer("employee_location"); analysis.setLocale(Locale.ENGLISH); analysis.trainBulk(freqMap); final TextAnalysisResult result = analysis.getResult(); System.out.printf("Semantic Type: %s , Result Type :%s, Confidence : %f , keyConfidence: %f %n", result.getSemanticType(), result.getType(), result.getConfidence(), result.getKeyConfidence()); System.out.println(result); }

`

Output as I get is (notice that I do not get proper semantic type (city/town ) in output: ` Semantic Type: null , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000 {"fieldName":"employee_location","totalCount":-1,"sampleCount":310,"matchCount":310,"nullCount":0,"blankCount":0,"distinctCount":7,"regExp":"(?i)(DALLAS|DELHI|LONDON|MUMBAI|NEW YORK|OTTAWA|PARIS)","confidence":1.0,"type":"String","isSemanticType":false,"min":"Dallas","max":"Paris","minLength":5,"maxLength":8,"topK":["Paris","Ottawa","New York","Mumbai","London","Delhi","Dallas"],"bottomK":["Dallas","Delhi","London","Mumbai","New York","Ottawa","Paris"],"cardinality":7,"outlierCardinality":0,"invalidCardinality":0,"shapesCardinality":3,"leadingWhiteSpace":false,"trailingWhiteSpace":false,"multiline":false,"keyConfidence":0.0,"uniqueness":0.0,"detectionLocale":"en","ftaVersion":"13.3.0","structureSignature":"ATNzuqht5V2TYABRIdUoj3BkXGo=","dataSignature":"rsLg6BQcvVBIS0Hn4d6JRGUcgDY="}

`

If I change TextAnalyzer name to employee_city - I get proper semantic type :
final TextAnalyzer analysis = new TextAnalyzer("employee_city");

output : Semantic Type: CITY , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000

Could you please suggest how can I make Semantic detection more agnostic of textAnalyzer name ?

tsegall commented 1 year ago

Please see the prior discussion associated with the issue #1.

Your example appears to be a synthetic example. There is no doubt that if we used a list of the 4000 cities with a population in excess of 100,000 (see for example https://fingolas.carto.com/tables/ergebnis/public) then we could possibly detect your case. However, if you analyze a large set of real data - see for example https://github.com/tsegall/semantic-types then this synthetic case never occurs. The current performance of the CITY Semantic Type on this real world data is:

SemanticType: CITY, Precision: 0.9991, Recall: 0.9889, F1 Score: 0.9940 (TP: 2221, FP: 2, FN: 25)

Hope this discussion is not too detailed

Executive summary

tsegall commented 1 year ago

Did this answer your question?