Unable to identify Semantic type when TextAnalyzer name is generic.

v4mishra commented 1 year ago

Hello,

I am using fta 13.3.0 and need to detect semantic type based on most frequent items.

Code is simlar to below :

` @Test public void testFTA() throws FTAPluginException, FTAUnsupportedLocaleException { LinkedHashMap<String,Long> freqMap = new LinkedHashMap<>(); freqMap.put("Mumbai", 50l); freqMap.put("Delhi", 20l); freqMap.put("New York", 10l); freqMap.put("Ottawa", 70l); freqMap.put("Paris", 40l); freqMap.put("London", 30l); freqMap.put("Dallas", 90l); final TextAnalyzer analysis = new TextAnalyzer("employee_location"); analysis.setLocale(Locale.ENGLISH); analysis.trainBulk(freqMap); final TextAnalysisResult result = analysis.getResult(); System.out.printf("Semantic Type: %s , Result Type :%s, Confidence : %f , keyConfidence: %f %n", result.getSemanticType(), result.getType(), result.getConfidence(), result.getKeyConfidence()); System.out.println(result); }

`

Output as I get is (notice that I do not get proper semantic type (city/town ) in output: ` Semantic Type: null , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000 {"fieldName":"employee_location","totalCount":-1,"sampleCount":310,"matchCount":310,"nullCount":0,"blankCount":0,"distinctCount":7,"regExp":"(?i)(DALLAS|DELHI|LONDON|MUMBAI|NEW YORK|OTTAWA|PARIS)","confidence":1.0,"type":"String","isSemanticType":false,"min":"Dallas","max":"Paris","minLength":5,"maxLength":8,"topK":["Paris","Ottawa","New York","Mumbai","London","Delhi","Dallas"],"bottomK":["Dallas","Delhi","London","Mumbai","New York","Ottawa","Paris"],"cardinality":7,"outlierCardinality":0,"invalidCardinality":0,"shapesCardinality":3,"leadingWhiteSpace":false,"trailingWhiteSpace":false,"multiline":false,"keyConfidence":0.0,"uniqueness":0.0,"detectionLocale":"en","ftaVersion":"13.3.0","structureSignature":"ATNzuqht5V2TYABRIdUoj3BkXGo=","dataSignature":"rsLg6BQcvVBIS0Hn4d6JRGUcgDY="}

`

If I change TextAnalyzer name to employee_city - I get proper semantic type :
final TextAnalyzer analysis = new TextAnalyzer("employee_city");

output : Semantic Type: CITY , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000

Could you please suggest how can I make Semantic detection more agnostic of textAnalyzer name ?

tsegall commented 1 year ago

Please see the prior discussion associated with the issue #1.

Your example appears to be a synthetic example. There is no doubt that if we used a list of the 4000 cities with a population in excess of 100,000 (see for example https://fingolas.carto.com/tables/ergebnis/public) then we could possibly detect your case. However, if you analyze a large set of real data - see for example https://github.com/tsegall/semantic-types then this synthetic case never occurs. The current performance of the CITY Semantic Type on this real world data is:

SemanticType: CITY, Precision: 0.9991, Recall: 0.9889, F1 Score: 0.9940 (TP: 2221, FP: 2, FN: 25)

Hope this discussion is not too detailed

Executive summary

I believe the Performance (F1 Score) on the City Semantic Type is approximately 99.5%.
Most of the Semantic Types are not as dependent on the header as CITY
Please feel free to respond if this does not make sense or you have any alternative thoughts

tsegall commented 1 year ago

Did this answer your question?

tsegall / fta

Unable to identify Semantic type when TextAnalyzer name is generic. #34