php / php-src

The PHP Interpreter
https://www.php.net
Other
38.31k stars 7.76k forks source link

Multi-Byte character breaks tempnam functionality #16845

Open bezin opened 1 week ago

bezin commented 1 week ago

Description

Hi all,

first of all thanks for all the development on PHP. Love the language more than ever. In a recent project I encountered this bug (?):

tempnam() shortens the given prefix after character 63. When passing a multi-byte character in this position, the shortened prefix contains a malformed character. MacOS does not seem to be able to handle that, hence no file is created and a notice is triggered. The PHP docs do not state, that multi-byte strings are not allowed with tempnam().

To reproduce

  1. Create a new folder
  2. Create script test.php in this folder and paste the following code into it:
    <?php
    // String concatenation only for highlighting the malicious character
    $prefix = 'this_string_pushes_the_multi_byte_character_in_position_______' . 'ä' . 'foobar';
    var_dump(tempnam(__DIR__, $prefix));
  3. Run php test.php in that directory.

Desired behaviour

It should create a temporary file in this folder and tempnam() should return the path to the file:

string(108) "/Users/benedict/dev/multibyte/this_string_pushes_the_multi_byte_character_in_position_______atWEta7"

Actual behaviour

On MacOS (tested with Sonoma 14.6.1 and Sequoia 15.1) no temporary file is created, a notice is triggered and tempnam() returns false:

PHP Notice:  tempnam(): file created in the system's temporary directory in /Users/benedict/dev/multibyte/test.php on line 16

Notice: tempnam(): file created in the system's temporary directory in /Users/benedict/dev/multibyte/test.php on line 16
bool(false)

When you change the ä in position 63 for anything but a multi-byte character, everything works as expected

If I can provide any further information, please let me know.

PHP Version

PHP 8.3.13

Operating System

MacOS Sonoma 14.6.1

devnexen commented 1 week ago

cc @youkidearitai

fritzmg commented 1 week ago

I can confirm the issue under Linux:

php -r "echo tempnam(__DIR__, 'this_string_pushes_the_multi_byte_character_in_position_______ä_foobar');"
…/this_string_pushes_the_multi_byte_character_in_position_______�0e3XkX

Windows does not seem to be affected (note: under Windows it only uses the first 3 characters of the prefix):

php -r "echo tempnam(__DIR__, 'abä');"
…\abä8B7D.tmp
youkidearitai commented 1 week ago

Hi, @bezin !

Thank you for report. This is filesystem problem. macOS filesystem is HFS+. HFS+ is normalized in NFD. For example, ä is \uc2a4 in UTF-8. However, NKD normalized ä is a\ucc88.

Based on this, below code is success and not displays error.

<?php
$prefix = 'this_string_pushes_the_multi_byte_character_in_position_______' . 'a\ucc88' . 'foobar';
var_dump(tempnam(__DIR__, $prefix));

If you want to normalize in PHP, please use Normalizer::normalize and normalizer_normalize.

<?php
$prefix = 'this_string_pushes_the_multi_byte_character_in_position_______' . normalizer_normalize("ä", Normalizer::NFD) . 'foobar';
var_dump(tempnam(__DIR__, $prefix));

reference:

fritzmg commented 1 week ago

@youkidearitai the issue is not the filesystem itself. The issue is that tempnam() cuts off multi-byte characters, resulting in an invalid character in the prefix (see examples above). The issue with the filesystem is just an error after the fact.

youkidearitai commented 1 week ago

@fritzmg I see. I tried reading the source code, but maybe it doesn't support multibyte. I feel like I need more understanding.

bezin commented 1 week ago

Dear @youkidearitai, thanks for taking the time to look into this!

I tested you proposed solution and it does indeed work with a normalized ä, however it does not work even with a normalized ß, leading to the same unexpected behaviour as stated above:

> php -r "var_dump(tempnam(__DIR__, 'this_string_pushes_th
e_multi_byte_character_in_position_______' . normalizer_normalize('ß', Normalizer::NFD) . 'foobar'));"

PHP Notice:  tempnam(): file created in the system's temporary directory in Command line code on line 1

Notice: tempnam(): file created in the system's temporary directory in Command line code on line 1
bool(false)

I second @fritzmg here, as I would expect the function to correctly handle multi-byte strings internally.

EDIT:

Adding insights by @gmpf:

When you use a normalized ä but remove one of the underscores, moving the ä to position 62, you run into the same issue:

> php -r "var_dump(tempnam(__DIR__, 'this_string_pushes_th
e_multi_byte_character_in_position______' . normalizer_normalize('ä', Normalizer::NFD) . 'foobar'));"

PHP Notice:  tempnam(): file created in the system's temporary directory in Command line code on line 1

Notice: tempnam(): file created in the system's temporary directory in Command line code on line 1
bool(false)

This is probably due to the fact that the normalization converts ä into a and the diaresis, which itself is a multi-byte character. Hence, character 63 in the string is again a multi-byte character:

php -r "var_dump(str_split(normalizer_normalize('ä', Nor
malizer::NFD)));"
array(3) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "�"
  [2]=>
  string(1) "�"
}
fritzmg commented 1 week ago

@fritzmg I see. I tried reading the source code, but maybe it doesn't support multibyte. I feel like I need more understanding.

$prefix of tempnam() intentionally not supporting multi-byte strings would still be fine - but it would need to be documented at least here.

However, I would argue that tempnam() should take multi-byte strings into account.

cmb69 commented 1 week ago

https://github.com/php/php-src/blob/bd5939d57e6dda2b7ccfd2eb7d11656a55e4d63e/ext/standard/file.c#L694-L697

While php_basename() caters to multibyte characters, the following code does not. Replacing that code with

p = php_basename(prefix, MIN(prefix_len, 63), NULL, 0); 

might work, if there wasn't

https://github.com/php/php-src/blob/bd5939d57e6dda2b7ccfd2eb7d11656a55e4d63e/ext/standard/string.c#L1457-L1461