tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

DateTimeParser with DateResolutionMode.Auto works as MonthFirst for all locales #10

Closed simonenkoi closed 2 years ago

simonenkoi commented 2 years ago

Code example to reproduce:

Arrays
        .stream(Locale.getAvailableLocales())
        .forEach(locale -> {
            var value = "02.02.2020";
            var dateTimeParser = new DateTimeParser()
                .withDateResolutionMode(DateTimeParser.DateResolutionMode.Auto)
                .withLocale(locale);
            dateTimeParser.train(value);
            var result = dateTimeParser.getResult();
            System.out.printf("Locale: %s, format: %s%n", locale, result.getFormatString());
        });

Expected behavior: The format is either "MM.dd.yyyy" or "dd.MM.yyyy" depending on locale

Actual behavior: All locales return the "MM.dd.yyyy" format

FTA version: 9.0.17

simonenkoi commented 2 years ago

I think you can validate correctness by using the next piece of code:

Arrays
    .stream(Locale.getAvailableLocales())
    .forEach(locale -> {
        var value = "02.02.2020";
        var dateTimeParser = new DateTimeParser()
            .withDateResolutionMode(DateTimeParser.DateResolutionMode.Auto)
            .withLocale(locale);
        dateTimeParser.train(value);
        var result = dateTimeParser.getResult();

        var fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.LONG, locale));

        System.out.printf(
            "Locale: %s, format: %s, should be %s%n",
            locale,
            result.getFormatString(),
            getFormatBasedOnFirstCharacter(fmt.toPattern())
        );
    }); 

public static String getFormatBasedOnFirstCharacter(String str) {
    for (char ch : str.toCharArray()) {
        //day first
        if (ch == 'd') {
            return "dd.MM.yyyy";
        }
        //month first
        if (ch == 'M') {
            return "MM.dd.yyyy";
        }
    }
    return null;
}
simonenkoi commented 2 years ago

Also, some locales use the YMD pattern, so the "02.02.02" date should be identified as "yy.MM.dd" for the following code example:

List
    .of(
        new Locale("nds"),
        new Locale("bo"),
        new Locale("lv"),
        new Locale("zh"),
        new Locale("vo"),
        new Locale("dz"),
        new Locale("sah"),
        new Locale("ml"),
        new Locale("mn"),
        new Locale("ja"),
        new Locale("my")
    )
    .forEach(locale -> {
        var value = "02.02.02";
        var dateTimeParser = new DateTimeParser()
            .withDateResolutionMode(DateTimeParser.DateResolutionMode.Auto)
            .withLocale(locale);
        dateTimeParser.train(value);
        var result = dateTimeParser.getResult();

        System.out.printf(
            "Locale: %s, format: %s",
            locale,
            result.getFormatString());
    });

https://en.wikipedia.org/wiki/Date_format_by_country#:~:text=DMY%20and%20MDY%20are%20used,spelled%20out%20to%20avoid%20confusion.

I can create a separate issue for it; it doesn't affect me as an end-user because I currently don't support these locales.

tsegall commented 2 years ago

The core issue is addressed in 9.0.18. The issue arose because fta-core (the date processing) is mostly invoked from fta (the semantic tagging and profiling module), which set the mode to day first or month first based on the locale. What version of Java were you using to see the YMD pattern?

simonenkoi commented 2 years ago
List
    .of(
        new Locale("nds"),
        new Locale("bo"),
        new Locale("lv"),
        new Locale("zh"),
        new Locale("vo"),
        new Locale("dz"),
        new Locale("sah"),
        new Locale("ml"),
        new Locale("mn"),
        new Locale("ja"),
        new Locale("my")
    )
   .forEach(locale -> {
        var fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.LONG, locale));

        System.out.printf("Locale: %s, format: %s%n", locale, fmt.toPattern());
    });

Java 11.0.14.1 OpenJDK

simonenkoi commented 2 years ago

Validated the core issue; works as expected in 9.0.18. Thank you!