tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

String type detect result seems incorrect #81

Closed ppr1011 closed 4 months ago

ppr1011 commented 4 months ago

fta version 15.5.5

I add a String value into the doubleList

` @Test public void detect() {

    List<String> doubleList = Arrays.asList(
            "1.0", "1.1", "1.2", "1.3", "1.4", "1.5",
            "1.1", "1.2", "1.3", "1.4", "1.5", "1.6",
            "1.1", "1.2", "1.3", "1.4", "1.5", "1.6", "test"
    );
    assertEquals("STRING", DataTypeDetector.detect(null, doubleList));
}

`

` public static String detect(String physicalType, List columnDataList) { if (CollectionUtils.isEmpty(columnDataList)) { return null; } TextAnalyzer textAnalyzer = new TextAnalyzer("*"); textAnalyzer.setLocale(Locale.ENGLISH); for (String input : columnDataList) { textAnalyzer.train(input); } TextAnalysisResult result = textAnalyzer.getResult(); return result.getType().name();; }

`

expected: STRING actual: DOUBLE

tsegall commented 4 months ago

In general a certain confidence is required to determine if a set of elements are recognized as a particular type. The default threshold for detection is 95% (although this can be modified) see documentation for setThreshold() on TextAnalyzer.

From the updated documentation ... The threshold (0-100) used to determine if a data stream is of a particular base type. For example, if the data stream has 100 samples and we see 97 valid doubles and 3 malformed values like '3.456e', 'e05', and '-' then provided the threshold is below 97 this stream will be detected as base type 'DOUBLE'.

You can set this 100 (i.e. strict mode) in this case any instance that is not a valid double will cause this stream to be reported as a STRING type. This is typically not what you want since you commonly see bad values in a user-generated data stream. On the other hand if the data is sourced from a database and hence the format is known to be good you may wish to set this to 100.

There was an issue with the threshold for Doubles not being honored in some cases which was addressed in 15.7.0.

The default threshold as shipped is 95%. So for example if you have 20 values one of which is in error (95%), this will still be detected as a Double, however if you have two in error (90.5%) then this will be detected as a string.