picocms / Pico

Pico is a stupidly simple, blazing fast, flat file CMS.
http://picocms.org/
MIT License
3.81k stars 616 forks source link

Broken meta.date_formatted output #560

Closed HuidaeCho closed 3 years ago

HuidaeCho commented 3 years ago

Hello,

I set my date_format in config/config.yml to %Y년 %m월 %d일 in ko_KR.UTF-8 (Korean). When I print {{ meta.date_formatted }}, its output is broken (2019ë 12ì 07ì¼ instead of 2019년 12월 07일) because of line 1,522 in vendor/picocms/pico/lib/Pico.php:

$meta['date_formatted'] = $meta['time'] ? utf8_encode(strftime($dateFormat, $meta['time'])) : '';

I was able to fix this issue by removing utf8_encode():

$meta['date_formatted'] = $meta['time'] ? strftime($dateFormat, $meta['time']) : '';

This issue is due to double UTF-8 encodings. It can be easily replicated:

<?php
$a = "2019년 12월 07일";
echo $a."\n";              # OK
echo utf8_encode($a)."\n"; # BAD; $a is already in UTF-8; don't utf8_encode it again!

outputs

2019년 12월 07일
2019ë 12ì 07ì¼

Please remove utf8_encode() from the above line.

Thanks!

PhrozenByte commented 3 years ago

We really should implement i18n and not rely on strftime() :unamused:

Anyway, since strftime() relies on the system's encoding (which might get changed by setlocale()) we must encode the value on systems using a non-UTF-8 default charset. I'd assume basically all Linux systems to use UTF-8, but there are OS vendors out there which are a few decades behind (Windows, obviously :unamused:), so... Since you use a UTF-8 encoded locale you shouldn't experience any problems with the solution below. However, on a side note for other users, this won't work for people using e.g. ko_KR.EUC-KR. That's because the static parts of your date string are encoded in UTF-8, while the strftime() placeholders would use EUC-KR, resulting in mixed encodings. If you use non-ASCII static chars in your date format you must use a UTF-8 encoded locale.

The funny thing is that PHP includes special charset detection lists (available via mb_detect_order()) for Japanese, Korean, Traditional and Simplified Chinese, Russian, Armenian, Turkish and Ukrainian, so that these locales should work out of the box with the solution below. However, the fallback charset detection list (called "neutral", which I assume is a bit more common…) doesn't include Windows-1252 (which is the most common non-cryllic charset after ISO-8859-1). So we need yet another hack just for PHP... :unamused: So, on yet another side note for other users, if you experience encoding issues with the formatted date string, set mbstring.detect_order in your php.ini to ASCII, UTF-8, <your system's charset>.

Can you please check whether the following works for you? (it should, just wanna be sure)

if ($meta['time']) {
    $encodingList = (mb_detect_order() === [ 'ASCII', 'UTF-8' ]) ? [ 'ASCII', 'UTF-8', 'CP1252' ] : mb_detect_order();
    $meta['date_formatted'] = mb_convert_encoding(strftime($dateFormat, $meta['time']), 'UTF-8', $encodingList);
} else {
    $meta['date_formatted'] = '';
}

Still a hack... As I said, relying on strftime() is bad... But hopefully a better working hack, even though it still has some issues.

HuidaeCho commented 3 years ago

I see. I replaced the original if (empty($meta['date_formatted'])) block with

if (empty($meta['date_formatted']) && $meta['time']) {
    $encodingList = (mb_detect_order() === [ 'ASCII', 'UTF-8' ]) ? [ 'ASCII', 'UTF-8', 'CP1252' ] : mb_detect_order();
    $dateFormat = $this->getConfig('date_format');
    $meta['date_formatted'] = mb_convert_encoding(strftime($dateFormat, $meta['time']), 'UTF-8', $encodingList);
}

and it works! Thanks!

HuidaeCho commented 3 years ago

However, on a side note for other users, this won't work for people using e.g. ko_KR.EUC-KR. That's because the static parts of your date string are encoded in UTF-8, while the strftime() placeholders would use EUC-KR, resulting in mixed encodings.

Here, do you mean that the encoding of config/config.yml can be different from that of the system so that the date_format setting may be in UTF-8 while the system locale may be EUC-KR? In this case, strftime() is passed a UTF-8 string and outputs a EUC-KR string? It may not even work properly, I guess.

Or do you mean that both the encodings of config/config.yml and the system may be in EUC-KR, but the Pico site setting is set to produce UTF-8 and NOT converting EUC-KR strftime() output to UTF-8 can be problematic?

Just curious. Thanks!

HuidaeCho commented 3 years ago

And is this hack going to be part of Pico or should I keep patching it myself?

PhrozenByte commented 3 years ago

Exactly. If you save config/config.yml in UTF-8 (what is default) and add static chars to date_format, the static chars are encoded in UTF-8. They are then passed to strftime() which replaces placeholders with chars encoded in the system's default charset. If the placeholders are replaced by ASCII chars only anyway (like %Y, the four digit representation for the year) it wouldn't make any difference. But for placeholders which yield non-ASCII chars (like %B, the full month name, based on the locale) the replacement chars are encoded in the system's charset, resulting in a string with some chars being encoded in UTF-8 and some in the system's charset. Pico always uses UTF-8.

And is this hack going to be part of Pico or should I keep patching it myself?

Will be released with v2.1.4 :+1: