php / php-src

The PHP Interpreter
https://www.php.net
Other
38.19k stars 7.75k forks source link

fopen creates files with utf characters under certain conditions #13640

Open ks74 opened 8 months ago

ks74 commented 8 months ago

Description

The following code:

if (@$_GET["charset"]) echo("set charset from ".ini_set("default_charset", "Windows-1251")."<BR>\n");
//ini_set("default_charset", "Windows-1251");
$FN="C:/Temp/АБВГДЕЖЗ.txt"; //no UTF here!
echo("filename: ".$FN."<BR>\n");
$hFile=fopen($FN, 'w'); fwrite($hFile, $FN); fclose($hFile); 

Resulted in this output: the created file was C:\TEMP\АБВГДЕЖЗ.txt (filename got converted to Utf-8) But I expected this output instead: the file should be C:\TEMP\АБВГДЕЖЗ.txt

Explanation: "АБВГДЕЖЗ" here are the first letters of Russian abc. They could be any other Russian letters or non-latin letters of another non-English language / charset. The initial setting of default_charset for me is UTF-8 (as returned by ini_set()).

This is not a trivial bug and it's not easy to reproduce it, yet it is possible. It only happens if a script calls ini_set("default_charset", "Windows-1251"); It is related to Apache's keep-alive connections reused for later requests.

Steps to reproduce:

  1. open 2 tabs in your browser, one with the script as is (for ex. dev.php) and another with charset parameter (dev.php?charset=1). If you run both tabs now they both will work correct.
  2. restart Apache
  3. refresh the 1st tab first several times (at least 2, the more the better). This creates Apache threads with a context of Utf-8. The created file is Ok.
  4. refresh the 2nd tab once. It creates a filename with Utf-8 characters. The reason is it is processed by a thread that just served the same script without changing default_charset.

For some time, no matter how many times you refresh 2nd tab, the file created will have Utf chars. But after some time the phenomenon will stop. Also, the bug may show even without restarting Apache, if you first call 1st tab many times, then wait for some time, like 2-3 minutes. But it takes longer.

This bug affects other files-related functions like file_exists() etc. Both Php and Apache have been upgraded to their last versions: Apache/2.4.58 (Win64) PHP/8.3.3

PHP Version

8.3.3

Operating System

Windows 8.1

ranvis commented 8 months ago

Can be reproduced on CLI too. On Windows, path encoding depends on default_charset, internal_encoding and context. Usually those ini settings aren't changed during runtime, but the current implementation looks confusing and problematic:

ini_set('default_charset', 'UTF-8');
ini_set('input_encoding', 'UTF-8');
ini_set('output_encoding', 'UTF-8');
ini_set('internal_encoding', 'UTF-8');

$path = 'АБВГДЕЖЗ.txt'; // Note this is in UTF-8.
$mbEncoding = 'Windows-1251';
$mbPath = mb_convert_encoding($path, $mbEncoding, 'UTF-8');
var_dump(touch($path)); // true

function test(string $path, string $mbPath): void {
    if (file_exists($path)) { echo "UTF-8\n"; }
    else if (file_exists($mbPath)) { echo "Windows-1251\n"; }
    else { echo "Missing\n"; }
}

test($path, $mbPath); // UTF-8

ini_set('default_charset', $mbEncoding);
test($path, $mbPath); // Windows-1251
ini_set('default_charset', 'UTF-8');

ini_set('internal_encoding', $mbEncoding);
test($path, $mbPath); // Windows-1251
ini_set('internal_encoding', 'UTF-8');

ini_set('internal_encoding', $mbEncoding);
ini_set('default_charset', 'UTF-8');
test($path, $mbPath); // UTF-8
ini_set('input_encoding', 'UTF-8');
test($path, $mbPath); // Windows-1251 ⚠

https://github.com/php/php-src/blob/3a832a2aad405466c24a5e8e5798cf9de14fda14/main/main.c#L545-L617

Edit: I'm sorry if I read OP incorrectly. I don't think "dev.php?charset=1" tab will work correctly.

ranvis commented 8 months ago

sapi_windows_cp_set() also affects to path, not just console codepage.

test($path, $mbPath); // UTF-8
sapi_windows_cp_set(1251);
test($path, $mbPath); // Windows-1251
ks74 commented 8 months ago

One more point. In order for the bug to show, a file with exactly the same name has to be saved in Utf-8 more (that's tab 1 with dev.php in my case). As I use one script for the test, the condition is met. But if I intentionally use different filenames in each mode, the bug doesn't happen. One way to do it is this:

$b1251=@$_GET["charset"]; 
$FN="C:/Temp/АБВГДЕЖЗ".($b1251 ? "-cs" : "").".txt"; 

That way tab 1 saves АБВГДЕЖЗ.txt while tab 2 saves АБВГДЕЖЗ-cs.txt no matter what I do (I can't reproduce the bug). That hints that the reason of the bug also includes some filename caching between http-requests in reused connections. (I tried clearstatcache(), it didn't help)

I don't think "dev.php?charset=1" tab will work correctly

Even if I change default_charset 3 times the original script (with ?charset=1) works correctly.

ks74 commented 8 months ago

Regarding CLI mode. The test() function is a little incorrect in that it checks Utf and if the file exists it doesn't check mb. But if it did check mb the file might exist, too. A better version could be this: (sorry I removed all those param types)

function test($path, $mbPath){

$CS="";
if (file_exists($path)) $CS.="UTF-8 ";
if (file_exists($mbPath)) $CS.="Windows-1251 ";
if (!$CS) $CS="Missing";

echo $CS."\n";
}

and when I run it first it displays "UTF-8 Windows-1251". Cause when the charset wasn't changed halfway Php works properly and both charsets can be used for filenames. Similarly, if you create the file with any of $path / $mbPath the result is the same.

In my case your script produced Missing in most cases except the 1st and last one. This seems to make more sense cause after you change charset on the fly Php can't properly handle strings no matter what charset they have. At the same time it finds both names when the charset doesn't change (all or nothing).

I'm not sure this is exactly the bug I described cause on CLI there is no such thing as reused http-connections with a context. Anyway, thanks for the help!

ranvis commented 8 months ago

Right, sorry for the noise. It looks like I was looking at the completely different thing.

ks74 commented 8 months ago

Right, sorry for the noise.

Well, at least you made me make more tests and find out what I wrote above, so this can't be noise at all : )

ranvis commented 8 months ago

OK. I have reproduced what you had mentioned. What I have been missing is that you have written the repro code in Windows-1251, not UTF-8. I've tried with CP932 (because of my Windows language, that covers a part of Cyrilic characters BTW), and confirmed that after accessing to dev.php, the subsequent request to dev.php?charset=1 produces garbled file name "ミ籍岱漬寅頒片孟・txt".

The problem could be circumvented with clearstatcache(true);. Maybe you had tried without true.

When the encoding PHP expects is UTF-8, Windows PHP first tries to decode the path "...АБВГДЕЖЗ..." from UTF-8 to UTF-16LE (Windows internal encoding). But since it is invalid as a UTF-8 sequence (in Windows-1251, it is C0 C1 C2 C3 C4 C5 C6 C7), the conversion fails. Then PHP tries with the current so-called ANSI code page, ACP, 1251 for your environment and 932 for my one, instead of UTF-8. https://github.com/php/php-src/blob/01b3fc03c30c6cb85038250bb5640be3a09c6a32/win32/codepage.h#L95-L104

That is why both conditions (charset=0/1) run correctly if executed alone.

It is problematic that the realpath cache is not considering the final encoding. But the real problem I think is, like written in the comment of the code I quoted above, that you are passing Windows-1251 encoded path when PHP expects path to be in UTF-8. If you make sure that you always pass correctly encoded path to IO functions, the problem should not occur even without clearing realpath cache, like:

$FN = mb_convert_encoding($FN, ini_get("default_charset"), 'Windows-1251');

Note that what encoding IO functions expects is somewhat strange (that is what I mentioned in the first comment), and it is not necessarily be the value of default_charset.

ks74 commented 8 months ago

What I have been missing is that you have written the repro code in Windows-1251, not UTF-8.

Yes, sorry I didn't mention it. I added a comment in the description.

If you make sure that you always pass correctly encoded path to IO functions, the problem should not occur

Yep, I confirm that converting filename to Utf in Utf mode prevents the problem to show.

The problem could be circumvented with clearstatcache(true);. Maybe you had tried without true

I confirm that clearstatcache(true) fixes the problem too. I tried without true and with true AND the 2nd parameter which was $FN and that was my mistake cause the cache entry could not be found for the same reason of different charsets.

It is problematic that the realpath cache is not considering the final encoding

Yep, I guess it uses original strings as keys (instead of final ones).

But the real problem is that you are passing Windows-1251 encoded path when PHP expects UTF-8.

That makes sense, and another way to look upon it is that one should not mix different charsets in one project. My project had Utf-8 as the default charset, but most of my scripts start with ini_set("default_charset", "Windows-1251"); Yet sometimes I debug things in dev.php which doesn't have the line. That created the mix and hence - eventually - the problem.

Or course, my real fix was to put php_value default_charset "Windows-1251" in httpd-vhosts.conf so that I don't have to think about changing charset or encoding paths ever again.

Thank you!

cmb69 commented 3 months ago

It is problematic that the realpath cache is not considering the final encoding.

Yes, if scripts switch the charset while the PHP process keeps running. If they don't, there shouldn't be any issue. See https://www.php.net/manual/en/migration71.windows-support.php for some general information about the UTF-8 path support introduced in PHP 7.1.

Now, I'm not sure what to do with this ticket. Is there different behavior on other platforms? Or are these not as … "flexible" as Windows in this regard?

ks74 commented 3 months ago

Sorry, I can't test other platforms.

Nor can I say whether Php should learn to consider any possible charset change done on the fly (during execution) for realpath cache. Perhaps it should if it doesn't take much effort. I realize this situation is rare, so the answer should be a tradeoff of how much efforts the solution would take and how much it would probably slow down realpath cache operation.