escapeshellarg() strips off bytes which are not valid characters (leading to file not found errors)

ohyeaah commented 8 months ago

Description

It seems escapeshellarg() strips off bytes which are >= 128 probably because they are not valid UTF-8 characters.

Because it's not possible to provide an encoding to escapeshellarg() and this behaviour is not documented I think it's a bug.

Also if this is expected behaviour then which function could be used instead? There is no other function which could do this.

Please note that the code example runs on Linux only because Windows (probably) doesn't have the cat command. Also I tested this on Linux only. Also note it's not necessary to close the php tag.

<?php
$fname = "asdf.txt".chr(128);
file_put_contents($fname, "hey!");
$cmd = "cat ".escapeshellarg($fname);
echo "cmd: $cmd\n";
passthru($cmd);

Resulted in this output:

cmd: cat 'asdf.txt'
cat: asdf.txt: No such file or directory

But I expected this output instead:

cmd: cat 'asdf.txt'$'\200'
hey!

PHP Version

PHP 8.2.7 (cli) (built: Jun 9 2023 19:37:27) (NTS)

Operating System

Debian GNU/Linux 12 (bookworm)

damianwadley commented 8 months ago

Using the top comment in the documentation for inspiration,

What's your current locale settings and have you tried changing them to something UTF-8-capable?

damianwadley commented 8 months ago

Actually, I'm just going to skip ahead and move this.

escapeshellarg and escapeshellcmd are both subject to the current locale, which makes sense given their intentions, so to pass a 0x80 (or UTF-8 string, or anything non-ASCII) as-is you'll need to be in a locale that supports it. This should apply equally to Windows. Under the hood, the man page to look up is for mbrlen(3) if available or else mblen(3).

Naturally, the docs should mention this.

ohyeaah commented 8 months ago

@damianwadley My locale was already set to UTF-8 and also calling setlocale() explicitly doesn't change anything.

damianwadley commented 8 months ago

@ohyeaah 0x80 isn't valid UTF-8 so it was getting stripped off.

I'm not sure there's a proper way of handling 0x80 in the same way your example demonstrates. I'd expect that in a real-world situation you'd have an actual character there, not literally just that byte - after all, escapeshellarg is for unknown inputs, and I can't imagine you'd want your unknown input to be dealing with file names consisting of unknown byte ranges.

After some testing, I think your solution should be to (install and) use an appropriate locale for your byte sequence - one of the ISO 8859s, for instance. Or in other words, to use a locale with the same character encoding you used when constructing the filename.

ohyeaah commented 8 months ago

@damianwadley Well I now understand how escapeshellarg() works and that it depends on the current locale.

But the point of my question was another: WHY does escapeshellarg() strip off bytes that are not part of the current locale at all?

Shouldn't escapeshellarg() also include characters that are NOT part of the current encoding?

For example ls on Linux also includes characters that are not part of the current encoding.

Many people use escapeshellarg() on files that come from the internet (like me). They don't know their encoding nor are they aware of that escapeshellarg() strips off characters which are not part of UTF-8 (which is the default encoding for most users).

damianwadley commented 8 months ago

I can't tell you exactly why it does that. I can tell you that it was added some 15 years ago back in the PHP 5.2 era with a note of "Properly address incomplete multibyte chars inside escapeshellcmd()".

So it was done deliberately; by the sound of it, to avoid accidentally passing invalid multibyte sequences.

But the point of my question was another: WHY does escapeshellarg() strip off bytes that are not part of the current locale at all?

Because that was the decision at the time? I don't know, and I don't know if Ilia A. or Stefan E. are active here on GitHub (let alone remember the motivation behind the changes) so I don't think we'll be able to ask them.

Had it been implemented today, I suspect an invalid string would produce an E_WARNING, but I don't see preserving the invalid bytes as being a viable solution given that:

Shouldn't escapeshellarg() also include characters that are NOT part of the current encoding?

Not if the requirement is to make sure it produces a valid and safe shell command/argument. For all I know, a byte sequence that violates the LC_CTYPE locale could create a vulnerability. And you know, I wouldn't even be that surprised if it did.

For example ls on Linux also includes characters that are not part of the current encoding.

Sure, but all it has to worry about is outputting a filename. That's totally different from having to worry about escaping arbitrary strings for execution in shell commands.

Many people use escapeshellarg() on files that come from the internet (like me). They don't know their encoding nor are they aware of that escapeshellarg() strips off characters which are not part of UTF-8 (which is the default encoding for most users).

Which is why this behavior needs to be documented, as it does with any other function that depends on the locale.

Also, while the default character encoding is often UTF-8, the default server locale may or may not be. Locales are a very different beast.

Also also, side note: allowing the user to dictate a filename on a server is highly unsafe. Always generate filenames - and especially extensions - yourself, and if you need the know the original name then store it safely in a database. Because really, there's no reason why the file's name on the server should be controlled by the user, so don't even give them the power to do that at all.

php / doc-en