laminas / laminas-form

Validate and display simple and complex forms, casting forms to business objects and vice versa
https://docs.laminas.dev/laminas-form/
BSD 3-Clause "New" or "Revised" License
80 stars 52 forks source link

Date pattern parsing fails with "exotic" locales #228

Closed pine3ree closed 8 months ago

pine3ree commented 1 year ago

The AbstractFormDateSelect::parsePattern() method https://github.com/laminas/laminas-form/blob/a41bb38a759590141e14fe907c107f80c7c3569b/src/View/Helper/AbstractFormDateSelect.php#L74

seems unable to handle less common locales. Added a failing test with zh_Hans_HK locale. In this case the month part is not extracted (https://github.com/laminas/laminas-form/pull/229)

Slamdunk commented 1 year ago

Thank you for bringing this up.

So the IntlDateFormatter::LONG pattern for zh_Hans_HK seems to be y年M月d日 z ah:mm:ss.

Then the FormDateTimeSelect tries to split that pattern here:

https://github.com/laminas/laminas-form/blob/abc710a89cf57b187f9e71e45912ebba2afdff63/src/View/Helper/FormDateTimeSelect.php#L197-L202

The last ([ \-,.:\/]+) is what we need to focus on: it considers few basic chars as the split sequence, but not those chinese chars inside y年M月d日.

We could try to fix this by replacing any non-ASCII char with a space, so the preg_split behaves as expected again, what do you think?

pine3ree commented 1 year ago

Hello @Slamdunk ,

I believe that those non-ASCII characters (mostly present in asian languages, I remember Japanese uses them kanji too) are meaningful delimiters and they should be captured as such (like the 'at' delimiter for en_US) so that are displayed later on before/after the corresponding "select" element. They usually mean "day", "month", "year",...and so on. I am not sure that simply surrounding them with single quotes before parsing will work. Different locale also use them differently:

Anyway, we should either (1) add tests and make the helpers work for all supported locales, or (2) limit the supported locales and add a generic simple alternative for those we do not (won't or can't) support.

kind regards

PS I guess that after JavaScript selectors appeared many years ago, very few developers are nowadays using "select" element groups for "datetime" related inputs.

Slamdunk commented 1 year ago

(2) limit the supported locales and add a generic simple alternative for those we do not (won't or can't) support.

That sounds fair enough to me: would you like to propose such change?

pine3ree commented 1 year ago

PS As a quick fix (what I added in my plates functions for laminas-form)

        if (!isset($result['month'])) {
            $result['month'] = 'M';
        }

and similar for other missed captures

(edit) not related to your answer, I saw it after posting

pine3ree commented 1 year ago

btw, this string, wrapping pictograms inside single quotes, y'年'M'月'd'日' z ah:mm:ss is parsed correctly

pine3ree commented 1 year ago

Premise: I deleted all previous comments, since I believe to have found a simpler generic regular common expression for splitting the intl date-time pattern, in expanded format:

const SPLIT_REGEX = <<<EOR
    /
        (
            [^a-z']*
            (?:
                \('[^']+'\)
                |
                '[^']+'
                |
                [^a-z']+
            )+
            [^a-z']*
        )+
    /xiu
    EOR;

together with the modified method:

function getPattern(string $locale, IntlDateFormatter $intl = null): string
{
    $intl = new IntlDateFormatter($this->getLocale(), $this->dateType, $this->timeType);

    $pattern = $intl->getPattern();
    // Remove time zone format character present in various forms
    $pattern = str_replace(['(z)', '[z]', 'z ', ' z ' , ' z'], ' ', $pattern);
    // Remove time meridiem character present in various forms
    $pattern = str_replace(['(a)', '[a]', ' a ' , 'a ', ' a'], ' ', $pattern);
    // Cleanup extra inner spaces
    $pattern = preg_replace('/\s+/', ' ', $pattern);
    // Remove trailing commas from previous operations
    $pattern = trim($pattern, ", \t\n\r\0\x0B");

    return $pattern;
}

The regex works like this:

ref: https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table ref: https://unicode-org.github.io/icu/userguide/format_parse/datetime/#date-field-symbol-table

the splitting string may have:

non alfabetic-ascii chars include standard date time separators like /, -, :, etc and all unicode symbols for year, month, day etc

result: https://onlinephp.io/c/97ff2