php / php-src

The PHP Interpreter
https://www.php.net
Other
38.13k stars 7.74k forks source link

mb_convert_encoding "\" (backslash) and "~" (tilde) convert failed to Shift_JIS #8281

Closed youkidearitai closed 2 years ago

youkidearitai commented 2 years ago

Description

Backslash(\) and tilde(~) is convert to Shift_JIS (SJIS) using mb_convrert_encoding, But converted word is wrong word. Please see below code and 3v4l https://3v4l.org/nSVPB. Reproduced only PHP 8.1.

The following code:

<?php
var_dump(mb_convert_encoding(mb_convert_encoding("\\", "SJIS", "UTF-8"), "UTF-8", "SJIS"));
var_dump(mb_convert_encoding(mb_convert_encoding("~", "SJIS", "UTF-8"), "UTF-8", "SJIS"));

Resulted in this output:

string(3) "\"
string(3) "〜"

But I expected this output instead:

string(1) "\"
string(1) "~"

PHP Version

PHP 8.1.4

Operating System

No response

nikic commented 2 years ago

cc @alexdowad

alexdowad commented 2 years ago

Dear @youkidearitai, thanks very much for this report. I will explain the behavior that you are observing below. However, if you have other information or references which may help to make these matters more clear, that is welcome.

The Wikipedia article on Shift-JIS is a good reference to start with. As it explains, single-byte characters in Shift-JIS are not in the ASCII character set, but rather in the JIS X 0201 character set. In JIS X 0201, 0x5C represents a Yen sign (¥, U+00A5). Also, in JIS X 0201, 0x7E represents an overline or overbar (‾, U+203E).

This is different from ASCII, where 0x5C represents a backslash (\) and 0x7E represents a tilde (~).

This means that when converting UTF-8 to Shift-JIS, we cannot correctly convert ASCII 0x5C to a Shift-JIS 0x5C byte, as that would change its meaning. However, Shift-JIS can also represent characters in the JIS X 0208 character set, using 2 bytes per character. Fortunately, JIS X 0208 kuten code 0x2141 is a "wave dash", which is similar to a tilde. JIS X 0208 kuten code 0x2140 is a backslash. So we can convert U+005C to the JIS X 0208 backslash, and U+007E to the JIS X 0208 wave dash.

The problem comes when converting in the reverse direction. ASCII has (halfwidth) backslash and tilde. Unicode has both halfwidth and fullwidth backslashes and tildes. However, JIS X 0208 has only one backslash character, which is generally treated as fullwidth, and one "wave dash" character, which is also treated as fullwidth. (JIS X 0201 has neither.)

The upshot of all this is that converting from Unicode → JISX 0201/0208 → Unicode is not a lossless conversion. If we convert JIS X 0208 0x2141 to the halfwidth tilde, then you may be happy, but others who were expecting to get a fullwidth tilde will not be. Likewise if we convert JIS X 0208 backslash to the halfwidth backslash.

You might wonder why you are only seeing this behavior on PHP 8.1. In short, it is because mbstring was buggy before that; it would convert U+005C to JIS X 0201 0x5C, which, as mentioned, is a different character. Likewise for U+007E and JIS X 0201 0x7E.

Generally, converting text back and forth between different legacy text encodings and expecting to get back what you started with is problematic. Since Unicode was designed to be a superset of all previous text encodings, it is generally best to do all processing in Unicode if possible. If that's not possible, then the best thing to do depends on the situation. If you must receive text in a legacy encoding and output it in the same legacy encoding, it may be best to do all the processing in the same encoding rather than converting to and from Unicode. On the other hand, if you receive text in legacy encodings but do not need to output it, then it would be better to convert to Unicode immediately when ingesting the text, and never convert it back.

(Note that what you are doing in the sample code is something which is almost never necessary or advisable: taking nice Unicode text, converting it to a legacy format, and then back to Unicode again. Converting from legacy → Unicode → legacy might sometimes make sense; converting from Unicode → legacy → Unicode almost never does.)

Please feel free to share any clarifying remarks, and thanks again.

youkidearitai commented 2 years ago

Dear, @alexdoward. thank you very much for reply.

I know ASCII backslash(0x5C) and tilde(0x7E) is different in JIS X 0201.But most Japanese users 0x5C and 0x7E is not using strict convert to 0x5C to U+00A5 and 0x7E to U+203E.

At least, it can be said that there are very few cases where the convert to backslash and tilde is multibyte.

Japanese Wikipedia Yen sign problem section in "現実的解決" (Realistic solution) explain Realistic solution.

この問題に対する現実的解決として、ほとんどの環境では日本の円記号はUnicodeのバックスラッシュ (U+005C) に変換される。 (As a practical solution to this problem, Japanese yen symbols are converted to Unicode backslashes (U+005C) in most environments.)

Some Japanese see to Yen sign, but code is almost 0x5C.

From this, Japanese language is through in general case ASCII and JIS X 0201 to 0x5C and 0x7E as it is.

Reference: プログラマのための文字コード入門 (ISBN978-4-7741-4164-0) Page 316 - 318

alexdowad commented 2 years ago

@youkidearitai, Thanks for those references. This was particularly interesting:

多くの日本語JISキーボードでは円記号とバックスラッシュのキーが別々に存在しているが、どちらを入力しても005Cが入力されるようになっている。

That is completely insane. Whoever came up with that idea deserves an award for bad design.

Thanks also for the reference to the book プログラマのための文字コード入門, though it doesn't seem there is any way I can access a copy right now. I will be in Japan in a few months (COVID-19 situation allowing) and could look it up then, but it would be nice to conclude this issue faster than that. 😆

As we consider this matter further, I think it would also help if you could explain: What is your use case here? How would treating SJIS 0x5C as a halfwidth backslash (U+005C) rather than a Yen sign (U+00A5) affect the software which you are developing or maintaining?

What is the typical source of SJIS text data for PHP-based software that you are working on, or are aware of? Text files uploaded by users? Direct text inputs in entry fields on a web site, from users whose OS supports SJIS input? Are we talking about newly created data, or legacy data which has been around for a long time, but you still need to process?

youkidearitai commented 2 years ago

@alexdowad , Thanks for reply. I'll answer questions.

Shift_JIS is still used in active use, and a common use case is importing CSV into Excel. Since Excel could only import CSV with Shift_JIS for a long time, there may still be cases where it is converted with SJIS.

In other words, mb_convert_encoding is done when downloading and uploading CSV. Also, Windows uses \0x5C for the directory separator. If convert this, it will not function as a directory separator.

This conversion of \0x5C and \0x7E can be confusing. In most Japanese character code implementations, it seems customary to convert \0x5C and \0x7E untouched.

Japanese PHP users may have other opportunities to use Shift_JIS, so I was very interested in talking about this problem.

alexdowad commented 2 years ago

Shift_JIS is still used in active use, and a common use case is importing CSV into Excel. Since Excel could only import CSV with Shift_JIS for a long time, there may still be cases where it is converted with SJIS.

In other words, mb_convert_encoding is done when downloading and uploading CSV.

OK. So you have users uploading CSVs which are Shift-JIS encoded, and you also export Shift-JIS-encoded CSVs for download? Are there reasons why you typically need to convert the text in those CSVs to Unicode?

Is it correct to say that when your users upload Shift-JIS-encoded CSVs, those may include 0x5C bytes, and the users expect those to be treated as backslashes? How about when you export Shift-JIS-encoded CSVs for download? Is it typical that you need to include halfwidth backslashes in them?

Do you offer options for the encoding of these files? Or you export in Shift-JIS encoding by default, because that is what works for the greatest number of your users?

Also, Windows uses \0x5C for the directory separator. If convert this, it will not function as a directory separator.

That's true; but remembering the context here, we are looking at use cases for conversion between Shift-JIS and Unicode, and trying to determine whether there are more cases where interpreting 0x5C/0x7E according to spec is preferable, or more cases where interpreting them as ASCII is preferable. The discussion is not general, but specific to mb_convert_encoding and the needs of its users.

Does Shift-JIS text that PHP-based applications receive from users typically include Windows pathnames? After converting to ASCII/UTF-8/etc., would your PHP code typically take those pathnames and interpolate them into a Windows shell command, or pass them to fopen? If so, is there a reason to convert to Unicode, or could the 'raw' Shift-JIS text be used?

This conversion of \0x5C and \0x7E can be confusing. In most Japanese character code implementations, it seems customary to convert \0x5C and \0x7E untouched.

It is certainly confusing. My preference is to follow published specifications when possible; I believe this tends to reduce confusion in the long term. However, if there are strong practical reasons to deviate from specifications, that can certainly be done.

However, we do not want to flip-flop back and forth. To avoid flip-flopping, we need to thoroughly understand all the implications of either following the spec or deviating from it. After all factors are considered, and as many interested parties as possible are consulted, if the final decision is to change, then we should document the reason for the decision and stick to it.

One of the great challenges involved in working on open-source, and especially popular projects like PHP, is that users only speak up when something is not working well for them. With proprietary, in-house software, you generally know who all the users are and can survey them to see what they think about proposed changes. With open source, you often only discover later how your users were impacted by some change. This is a good reason to carefully think changes through and gather as much information as possible before deciding.

You mentioned that it seems customary to treat SJIS 0x5C and 0x7E as ASCII; if you can share as many specific examples as possible of existing software which does or doesn't do this, that would be appreciated. What is 'customary' is definitely one important factor to consider, since it shapes people's expectations.

youkidearitai commented 2 years ago

There are tons of cases where Unicode and Shift_JIS conversions are involved in CSV uploads and downloads, and it can be difficult to find out how much they are. Similarly, it's hard to find out how much a Windows path doesn't work.

However, many users will find it very difficult for 0x5C and 0x7E to have "strict" conversions the moment they upgrade to PHP 8.1. This is good enough for Japanese users to hesitate to upgrade to PHP 8.1.

This is because it is perceived by the Japanese as being converted to a different character.

It is certainly confusing. My preference is to follow published specifications when possible; I believe this tends to reduce confusion in the long term. However, if there are strong practical reasons to deviate from specifications, that can certainly be done. However, we do not want to flip-flop back and forth. To avoid flip-flopping, we need to thoroughly understand all the implications of either following the spec or deviating from it. After all factors are considered, and as many interested parties as possible are consulted, if the final decision is to change, then we should document the reason for the decision and stick to it.

As you said, I think it is correct to follow the published specifications. However, this is a change that breaks backwards compatibility, and if so, I feel that this change should be discussed in PHP RFCs and so on.

As about for the convention(customary), at least in Python 3, even if 0x5C or 0x7E is converted to Shift_JIS, it is converted as it is.

$ python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '\\'.encode("SJIS")
b'\\'
>>> '\\~'.encode("SJIS")
b'\\~'
>>> 'あ'.encode("SJIS")
b'\x82\xa0'
>>>

I tried it with Ruby 2.7. After all it converted as it is.

$ ruby --version
ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
$ ruby -e 'puts "あ".encode("SJIS").encode("UTF-8")'
あ
$ ruby -e 'puts "\\".encode("SJIS")'
\
$ ruby -e 'puts "~".encode("SJIS")'
~
cmb69 commented 2 years ago

Apparently, there are several variations of Shift JIS in use, of which some are already supported by MBString. In this case @youkidearitai is likely looking for Windows CP 932 ("CP932"), which behaves as expected wrt. ~ and \, see https://3v4l.org/ZjiBS.

I think this is just something that we should document better.

alexdowad commented 2 years ago

@cmb69, excellent point! Indeed, CP932 is Microsoft's version of Shift-JIS.

youkidearitai commented 2 years ago

As a Japanese user, it's a sad that it wasn't communicated correctly. How do Japanese SJIS user do upgrade PHP 8.1?

alexdowad commented 2 years ago

As a Japanese user, it's a sad that it wasn't communicated correctly.

Indeed. Probably that was because the change was considered to be a "bug fix".

Although @cmb69 has made a good point (that the text encoding which you are interested in does actually still exist under a different name), I don't think that is necessarily the 'last word' on this issue. We are still open for suggestions.

alexdowad commented 2 years ago

Just did a brief search for Composer packages which might be affected by this issue.

@SUKOHI's FluentCsv library uses SJIS-win, which is another name for CP932. (Good!) Likewise, @gh640's sjis-stream-filter library uses SJIS-win.

However, @gh640 has another library called sjis-zip which uses SJIS (not SJIS-win). Also, @quyhoa has a library called Commoncsv which also uses SJIS.

Comments from any of these developers on this GH issue would be much appreciated.

Does anyone have a good way to do a text search across all Composer packages? I seem to remember that @nikic has, on occasion, mentioned that "the top 2000 Composer packages don't use such-and-such"; maybe he has something along those lines?

cmb69 commented 2 years ago

@alexdowad, see https://github.com/nikic/popular-package-analysis

zonuexe commented 2 years ago

Hi @alexdowad. First of all, I would like to express my gratitude and respect to you and the original developers of mbstring, as I know your refactoring achievements through this article.

Conversion between multiple character sets is always difficult, and this problem has plagued the Japanese for 30 years with Unicode, and the conversion map between JIS and Unicode in the original mbstring is not just a bug in the spec. As the behavior of Ruby shows, it is based on the customs and use cases of many Japanese users from that time to the present.

unicode.org provides JIS and Unicode conversion maps at https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/, but this is not part of the standard and is just reference data. Please note in particular.

Replacing \ with ¥ and preserving it as a backslash are both "correct" implementations.

The difference between these ideas is also expressed in iconv and nkf. The Japanese-implemented nkf is not a standard command, but it is still used by old school Japanese UNIX users.

% echo '\~abc' | iconv -f sjis -t utf8
¥‾abc
% echo '\~abc' | nkf -Sw8
\~abc

In addition to Wikipedia, several companies and Japanese developers explain how to implement JIS and Unicode mapping.

As mentioned in the discussion so far, Microsoft has burned an extraordinary obsession with displaying backslash as ¥ in a number of Japanese locales.

Microsoft still describes CP932 to users as Shift_JIS or "シフトJIS", so due to their efforts many Japanese are unaware of these encodings and character sets. The same is true for many Japanese PHP programmers.

This is a post by a forum user, but it contains an Excel screen that says "シフトJIS" (Shift_JIS).

image

Although Microsoft has increased Unicode support for Excel in recent years, Japanese users still believe that converting to Shift_JIS(CP932) for importing and exporting data between the Web and Excel is a safe and secure method.

These are just a few, and some of these articles show code that outputs broken CSV, but many Japanese users are more concerned about "文字化け" (mojibake).

It is believed that many Japanese companies using Windows still use that method. Unfortunately, many of them are not interested in disseminating information to the tech community. (Moreover, many programmers hired by such companies may not be aware of Composer's existence...)


I think changing the character encoding and conversion map should have been a careful debate, but I agree that "flip-flops" can cause further confusion.

Converting all SJIS to CP932 is not a good option for backwards compatibility as Shift_JIS and CP932 have different conversion maps for Chinese characters.

The improvement I suggest is to specify in PHP: mb_convert_encoding - Manual that the conversion map has changed in PHP 8.1 and provide a backwards compatible and secure workaround.

Since the only characters converted from ASCII in PHP 8.1 are ~ and \ (https://3v4l.org/eLeHE), it is possible to maintain compatibility of the conversion results by converting these with strtr().

$str = strtr(mb_convert_encoding($str, 'UTF-8', 'SJIS'), ['¥' => '\\', '‾' => '~']);

I am grateful to all of you for your efforts on these issues.

alexdowad commented 2 years ago

@zonuexe Wow!!! I am floored by the quality, detail, and lucidity of your comment here. (Picks jaw up from floor.) Thanks for your contribution to this discussion.

It will take me some time to digest all the references in your comment. However, if I can ask a couple of questions first...

zonuexe commented 2 years ago

@alexdowad Thank you for your response.

Do you think these changed mappings should be rolled back at least temporarily, so there is more time to work on documentation, etc?

No. It is not desirable for minor versions to change the behavior. The behavior in PHP 8.1 surprises some Japanese people, but it can be dealt with if it is clearly stated in the specification. I now think it would be better to roll back to a conversion map prior to PHP 8.0.

Do you think there is any value in providing both 'strict' and... uh... 'not strict' SJIS conversion modes which could be selected via an .INI setting or the like? Or would that just be adding complexity with little real benefit to PHP users? (Of course, if we did add a config setting, 99% of users would probably never read the documentation, and would never know about it or use it.)

This is a difficult decision, but I think introducing strict mode with ini or other parameters will increase uncertainty and make it difficult to predict and convert results. Another option is to add another encoding name like "SJIS-strict" or "SJIS-compat", but it seems a bit obscure.

Overall, I think it is realistic to document the current implementation.

sj-i commented 2 years ago

As a Japanese, I think rollbacking for now is also an OK choice.

https://packagist.org/php-statistics According to the stats from packagist, the adoption of PHP 8.1 is about to reach 20%, but still 20%. Only a subset of them are affected by this BC breakage and should have already applied a workaround for it. And a further small subset would be the workaround that would cause problems if this BC breakage were "fixed" in a point release. The number of people affected by this will be much smaller than those who upgrade to 8.1 in the future and are affected by the current incompatibility.

zonuexe commented 2 years ago

There are good reasons and pains for both rolling back and not rolling back, so take my opinion as one of the judgments.

As @sj-i said on Twitter, PHP's mbstring has a 20-year history, and I found it important to point out that its behavior is a not-so-small part of the SJIS conversion convention.

yandod commented 2 years ago

I can easily imagine there are lot of website and system have feature depends on existing mbstring behavior around those encoding handling. and if they notice BC break on latest PHP, they may stop upgrade to newer version.

since real world is chaotic around Japanese character encoding, introduce "clean" solutions is very difficult. stay on old behavior looks most safe option to me.

sj-i commented 2 years ago

BTW, on the current implementation, SJIS-win is an alias of CP932, thus the return value of mb_list_encodings() never contains 'SJIS-win'. I have created a separate issue for this. #8308

@alexdowad I am no representative of the Japanese community, but I apologize for the lack of feedback from the Japanese community prior to the release of PHP 8.1 regarding the mbstring compatibility. Also, thank you for caring so well about the implementation of mbstring, which the Japanese don't touch much these days...!

alexdowad commented 2 years ago

BTW, on the current implementation, SJIS-win is an alias of CP932, thus the return value of mb_list_encodings() never contains 'SJIS-win'. I have created a separate issue for this. #8308

Thanks for opening that issue. I have added some comments there.

alexdowad commented 2 years ago

Just read a few of the articles linked to above by @zonuexe, still remaining with a few more to read.

zonuexe commented 2 years ago

@alexdowad I have presented these references as a separate source to show that there is more than one type of character set conversion map like Japanese Shift-JIS Character Mapping - IBM Documentation. It seems that you have already implemented it based on the necessary knowledge. There may be little information available from these haystacks.

alexdowad commented 2 years ago

The discussion here seems to have fallen quiet.

I think the discussion in #8308 has convinced me that adjusting mappings for 'SJIS' to comply with JIS X 0201 was a mistake, and there are good reasons not to follow that specification.

I would like to submit a PR to revert that change, but first, are there any other mappings for SJIS or SJIS variants which are a concern for @youkidearitai or other interested parties who are following this thread? Or is it just SJIS 0x5C → Unicode and SJIS 0x7E → Unicode?

youkidearitai commented 2 years ago

Sorry I mistake close this Issue. As far as I can see, it's just a matter of 0x7e and 0x5c. There may be something wrong, but I'd also like to see this massive overhaul of mbstring, so if I find something, I'd like to set up a new Issue, but how about it?

I also want to hear what Japanese people think.

youkidearitai commented 2 years ago

I've gathered various opinions on this issue, Already most Japanese programmers see SJIS as a legacy character encoding, It seems that the general idea is to maintain backward compatibility. There seems to be no backward compatibility issues other than 0x5c and 0x7e, so I'd appreciate it if you could fix it.

alexdowad commented 2 years ago

@youkidearitai Absolutely, that will be done!

Thank you very much for reporting this issue and for following up on it.

youkidearitai commented 2 years ago

Thank you, everyone!