tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

Double values with German locale doesnt work as expected #71

Closed andreainfogix closed 6 months ago

andreainfogix commented 6 months ago

With the version upgrade from 8.0.22 to 14.6.1 there is an issue with values that are representing double values when Locale is set to GERMANY. These used to work with the previous version and the expected return type is double however now it returns a string.


    public static void main(String[] args) throws FTAPluginException, FTAUnsupportedLocaleException {

        String[] headers = { "First", "Last", "MI" };
        String[][] names = { { "Anaïs", "Nin", "9,876.54" }, { "Gertrude", "Stein", "3,876" },
                { "Paul", "Campbell", "76.54" }, { "Pablo", "Picasso", "123.45" } };

        AnalyzerContext context = new AnalyzerContext(null, DateResolutionMode.Auto, "customer", headers);
        TextAnalyzer template = new TextAnalyzer(context);

        template.setLocale(Locale.GERMANY);

        RecordAnalyzer analysis = new RecordAnalyzer(template);

        for (String[] name : names) {
            analysis.train(name);

            RecordAnalysisResult recordResult = analysis.getResult();

            for (TextAnalysisResult result : recordResult.getStreamResults()) {
                System.err.printf("Semantic Type: %s (%s)%n", result.getSemanticType(), result.getType());

            }

        }
    }
tsegall commented 6 months ago

Localized number support has been completely reworked since 8.0.22. This test will not work since 3,876 is a valid German number and so there is one valid and 3 invalid. if you change 3,876 to 3,876.01 it will work. Note: there is now a typeModifier that indicates non-localized numbers. See issue71() in TestIssues.java for more details.