oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.38k stars 1.63k forks source link

[GR-52826] Non-ASCII characters in command line arguments are replaced by U+FFFD in Windows (native-image) #8593

Open ackasaber opened 7 months ago

ackasaber commented 7 months ago

Describe the issue It seems that the command line arguments aren't properly decoded when building via native-image in Windows. A simple one-liner that dumps the arguments into a text file demonstrates this.

I observed this when trying to pass a Russian-localized "My Pictures" path to a command-line utility.

The issue doesn't happen when building the classic way.

Steps to reproduce the issue

  1. Create the following LogArguments.java source file.
import java.io.IOException;
import static java.nio.charset.StandardCharsets.UTF_16;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;

public class LogArguments {
    public static void main(String[] args) throws IOException {
        Files.write(Path.of("arguments.txt"), List.of(args), UTF_16);
    }
}

It dumps the command line arguments to a file arguments.txt in the current directory, one per line and as is (in UTF-16).

  1. javac LogArguments.java
  2. native-image LogArguments
  3. Run the created logarguments.exe as follows
logarguments.exe Свободу политзаключённым!
  1. Examine the created arguments.txt file.
�������
����������������!

The hex dump verifies that all non-ASCII characters got replaced by the Unicode U+FFFD character:

           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00000000   FE FF FF FD FF FD FF FD FF FD FF FD FF FD FF FD  þ..ý.ý.ý.ý.ý.ý.ý
00000010   00 0D 00 0A FF FD FF FD FF FD FF FD FF FD FF FD  .....ý.ý.ý.ý.ý.ý
00000020   FF FD FF FD FF FD FF FD FF FD FF FD FF FD FF FD  .ý.ý.ý.ý.ý.ý.ý.ý
00000030   FF FD FF FD 00 21 00 0D 00 0A                    .ý.ý.!....

Describe GraalVM and your environment:

I reproduced this with the graalvm-community-openjdk-23+12.1 build.

>java --version
openjdk 23 2024-09-17
OpenJDK Runtime Environment GraalVM CE 23-dev+12.1 (build 23+12-jvmci-b01)
OpenJDK 64-Bit Server VM GraalVM CE 23-dev+12.1 (build 23+12-jvmci-b01, mixed mode, sharing)

More details I glanced over GraalVM sources and didn't find any GetCommandLineW mentions so there is a good chance the thing was never implemented in Windows and main arguments are used.

fniephaus commented 7 months ago

Thanks a lot for bringing this to our attention, @ackasaber! Let me try to reproduce, debug, and fix this...

fniephaus commented 7 months ago

Ok, so I'm having problems to reproduce this. I tried changing my system language but that did not seem to work. In a PowerShell, I can run this:

> python -c "import sys; print(sys.argv)" Привет, мир
['-c', 'Привет', 'мир']
> java "-Dfile.encoding=UTF8" LogArguments Привет, мир
> cat .\arguments.txt
??????
???

With and without -Dfile.encoding=UTF8, I only get ?s in the arguments file.

Do you have any idea what is going on?

fniephaus commented 7 months ago

Ok, so I got this working after enabling the beta support for UTF-8, but also, the native executable seems to do the right thing (I added a simple System.out.println(List.of(args)); to your reproducer): image

Could you please provide us with more details how to reproduce the inconsistent behavior you're observing?

ackasaber commented 7 months ago

Very nice! You've found THE dialog. I confirm that if UTF-8 is activated there, both classic launching and native-image build work consistently and correctly.

However, the UTF-8 option is not the default and the majority of users don't know about this dialog. This dialog is a legacy of pre-Unicode days. As you see, it's intended for "non-Unicode programs". Java is Unicode, right? Therefore it shouldn't depend on this option! The default value in the dialog is set during the Windows installation according to the Windows localization. For Russian-localized Windows that I've got it's "Russian (Russia)". It means that legacy A-versions of Windows API functions that take or return strings encode them in Windows-1251 encoding.

I intentionally didn't use anything that depends on file.encoding in the sample code. I also do not trust in System.out for non-ASCII output, in my experience it was always hit-or-miss depending on where the output is displayed. Not only JCL itself plays games with its encoding, IDEs and various shells also have conflicting opinions on the standard output encoding. So dumping arguments into a file with a known encoding isolates the problem to the main arguments values.

I also tested running the sample with "English (USA)" for "Language for non-Unicode programs" and got ?s both for classic and native runs. PowerShell was positively crazy in the process, I couldn't even type Russian in there.

ackasaber commented 7 months ago

Your Python piece by the way works well independent of the "Language for non-Unicode programs", at least in cmd.exe. PowerShell has troubles displaying Cyrillic characters when this option doesn't match the argument language, but Python output is still okay somehow.

fniephaus commented 7 months ago

Thanks for the info, @ackasaber. I can finally reproduce the issue after changing the system locale (there are just too many ways to changes languages on Windows these days)....now looking into a fix.

fniephaus commented 7 months ago

Ok, so I have discussed this internally: The JDK seems to convert arguments in their app launchers: https://github.com/openjdk/jdk/blob/700d2b91defd421a2818f53830c24f70d11ba4f6/src/jdk.jpackage/windows/native/common/WinSysInfo.cpp#L137

Instead of doing this, we can avoid the additional overhead (and potential for errors) by switching to wmain on Windows. This will also allow us to provide other features on Windows such as a javaw.exe like entry point that allows running an app without a command prompt.

We currently have no ETA for this but we will update this ticket when we do.

CJ-Chen commented 6 months ago

Hi, there. I encounter a similar probleam today. Try many parameters and failed. However, I manager to solve this probleam by "provide a parameter file which containing all non-asciicode" instead of directly provide all parameters via command line. image

Thus, maybe we could just write a simple wrapper script to invoke the native image.

ackasaber commented 6 months ago

@CJ-Chen A nice workaround! But not a good general solution for GraalVM.

I would be surprised if this bug gets a fix this year. It just can't be a priority: it's only in Windows while the majority of Java apps run on Linux + it's in command line parsing and the majority of Java apps don't do much of it.

ackasaber commented 4 months ago

Apparently Microsoft does a U-turn with their encodings zoo and now promotes using UTF-8 for new applications using a dedicated manifest property. The introduced activeCodePage manifest property was introduced in Windows 10 Version 1903.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, then -A APIs typically operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.

The articles goes so far as to call "Win32 API [that] might only understand WCHAR" legacy.

It's also possible to slap this manifest onto an existing exe. I'll try it in the meantime.

See also a blog post from Raymond Chen about this feature.