Closed v4mishra closed 1 year ago
Please see the prior discussion associated with the issue #1.
Your example appears to be a synthetic example. There is no doubt that if we used a list of the 4000 cities with a population in excess of 100,000 (see for example https://fingolas.carto.com/tables/ergebnis/public) then we could possibly detect your case. However, if you analyze a large set of real data - see for example https://github.com/tsegall/semantic-types then this synthetic case never occurs. The current performance of the CITY Semantic Type on this real world data is:
SemanticType: CITY, Precision: 0.9991, Recall: 0.9889, F1 Score: 0.9940 (TP: 2221, FP: 2, FN: 25)
Hope this discussion is not too detailed
Executive summary
Did this answer your question?
Hello,
I am using fta 13.3.0 and need to detect semantic type based on most frequent items.
Code is simlar to below :
` @Test public void testFTA() throws FTAPluginException, FTAUnsupportedLocaleException { LinkedHashMap<String,Long> freqMap = new LinkedHashMap<>(); freqMap.put("Mumbai", 50l); freqMap.put("Delhi", 20l); freqMap.put("New York", 10l); freqMap.put("Ottawa", 70l); freqMap.put("Paris", 40l); freqMap.put("London", 30l); freqMap.put("Dallas", 90l); final TextAnalyzer analysis = new TextAnalyzer("employee_location"); analysis.setLocale(Locale.ENGLISH); analysis.trainBulk(freqMap); final TextAnalysisResult result = analysis.getResult(); System.out.printf("Semantic Type: %s , Result Type :%s, Confidence : %f , keyConfidence: %f %n", result.getSemanticType(), result.getType(), result.getConfidence(), result.getKeyConfidence()); System.out.println(result); }
`
Output as I get is (notice that I do not get proper semantic type (city/town ) in output: ` Semantic Type: null , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000 {"fieldName":"employee_location","totalCount":-1,"sampleCount":310,"matchCount":310,"nullCount":0,"blankCount":0,"distinctCount":7,"regExp":"(?i)(DALLAS|DELHI|LONDON|MUMBAI|NEW YORK|OTTAWA|PARIS)","confidence":1.0,"type":"String","isSemanticType":false,"min":"Dallas","max":"Paris","minLength":5,"maxLength":8,"topK":["Paris","Ottawa","New York","Mumbai","London","Delhi","Dallas"],"bottomK":["Dallas","Delhi","London","Mumbai","New York","Ottawa","Paris"],"cardinality":7,"outlierCardinality":0,"invalidCardinality":0,"shapesCardinality":3,"leadingWhiteSpace":false,"trailingWhiteSpace":false,"multiline":false,"keyConfidence":0.0,"uniqueness":0.0,"detectionLocale":"en","ftaVersion":"13.3.0","structureSignature":"ATNzuqht5V2TYABRIdUoj3BkXGo=","dataSignature":"rsLg6BQcvVBIS0Hn4d6JRGUcgDY="}
`
If I change TextAnalyzer name to employee_city - I get proper semantic type :
final TextAnalyzer analysis = new TextAnalyzer("employee_city");
output :
Semantic Type: CITY , Result Type :String, Confidence : 1.000000 , keyConfidence: 0.000000
Could you please suggest how can I make Semantic detection more agnostic of textAnalyzer name ?