Closed fengzyf closed 6 months ago
I also asked on stackoverflow see (https://stackoverflow.com/questions/77880391/parpacker-packaged-scripts-lose-the-ability-to-parse-utf-8-arguments-from-the), but it looks like the problem was caused by a PAR-Packer.
You didn't read carefully: ikegami wrote "Works for me".
I can't test this anyway (no Windows here), but I'm curious: what is the actual problem you're trying to solve?
Why do you need GetCommandLineW
, what's wrong with @ARGV
?
The actual problem: Without changing the basic Settings in "Info", I want to pass the utf-8 character "äö.txt" as an argument to my perl script,
I found ikegami's answer in https://stackoverflow.com/a/63868721:
Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le. Perl uses the A version of all system calls. That includes the call to get the command line.
The ACP is hard-coded. chcp changes the console's CP, but not the encoding used by the A system calls.
The choice of console's CP (as set by chcp) has no effect on how Perl receives the command line. Because Perl uses the A version of the system calls, the command line will be encoded using the ACP regardless of the console's CP and the OEM CP.
But what if you wanted to support arbitrary Unicode characters instead of being limited to those found in your system's ACP? As mentioned above, you could [change](https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page) perl's ACP. Changing it to 650001 (UTF-8) would give you access to the entire Unicode character set.
Short of doing that, you would need to get the command line from the OS using the W version of the system call and parse it.
While Perl uses the A version of system calls, this doesn't limit modules from doing the same. They may use W system calls.[3] So maybe there's a module that does what you need. If not, I've previously written [code](https://stackoverflow.com/a/44489228/589924) that does just that.
Because ACP in my OS is 936 and it supports encoding gb2312, "ä" and "ö" are not within the range supported by gb2312, so "äö" has been replaced by "??" when passed to @ARGV
.
Therefore, I chose to use the script provided by ikegami in Handling wide char values returned by Win32::API to obtain command line parameters through GetCommandLineW
. This time it was successful passing "äö" to my script. But when I packaged it with PAR::Packer, "äö" was passed back to "??".
If I change the value of ACP in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
from 936 to 65001, When I run CmdArgs.exe äö.txt
, "äö" will be passed correctly. This explains why ikegami wrote "Works for me."
I guess the reason for the occurrence is:
When I run CmdArgs.exe äö.txt
, CmdArgs.exe
automatically decompresses to a temporary directory, recording the new execution file path as $path_new
and the command line parameter @args_new
(which also only supports characters in ACP
(936), "äö" has been replaced by "??"), and then runs a new command based on $path_new
and @args_new
. Therefore, the command line parameter obtained by GetCommandLineW
is C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe ??.txt
.
Again, I don't want to change the ACP from 936 to 65001. See https://stackoverflow.com/a/68066008.
Do you have any opinion on my guess? Where can I modify the so-called @args_new if something similar exists?
I guess the reason for the occurrence is: When I run CmdArgs.exe äö.txt, CmdArgs.exe automatically decompresses to a temporary directory, recording the new execution file path as $path_new and the command line parameter @args_new (which also only supports characters in ACP (936), "äö" has been replaced by "??"), and then runs a new command based on $path_new and @args_new. Therefore, the command line parameter obtained by GetCommandLineW is C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe ??.txt.
That's basically correct. A few more details:
myldr/boot.c
int main ( int argc, char **argv, char **env )
GetCommandLineA
to fill in argv
☹️ C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a
if necessary and extracts (amongst others) a file called CmdArgs.exe
into it which is a special purpose perl interpreter; the path to this perl interpreter is called my_perl
in myldr/boot.c
argv
is passed unchanged except argv[0]
which is changed to myperl
)
rc = spawnvp(P_WAIT, my_perl, (char* const*)argv);
myldr/main.c
argv
with a few additional arguments resulting in fakeargv
exitstatus = perl_parse(my_perl, par_xs_init, fakeargc, fakeargv, NULL);
Note: The problem with command line handling for Perl on Windows is known, e.g. @ARGV, -CA and Win32, but no solution so far.
Thank you very much for your help! I am not a professional programmer, I am not familiar with C language. With the help of chatgpt, I have solved this problem. If there are any areas that can be modified, please let me know.
I made changes only to boot.c
:
Embed the following code before par_init_env();
;
LPWSTR lpCmdLine = GetCommandLineW();
LPWSTR* argv_w = CommandLineToArgvW(lpCmdLine, &argc);
if (argv_w == NULL) {
printf("Failed to parse command line\n");
return 1;
}
** argv = (char**)malloc(argc * sizeof(char*));
if (argv == NULL) {
printf("Failed to allocate memory\n");
LocalFree(argv_w);
return 1;
}
for (int i = 0; i < argc; i++) {
int len = WideCharToMultiByte(CP_UTF8, 0, argv_w[i], -1, NULL, 0, NULL, NULL);
if (len == 0) {
printf("Failed to convert argument %d\n", i);
LocalFree(argv_w);
free(argv);
return 1;
}
argv[i] = (char*)malloc(len * sizeof(char));
if (argv[i] == NULL) {
printf("Failed to allocate memory\n");
LocalFree(argv_w);
for (int j = 0; j < i; j++) {
free(argv[j]);
}
free(argv);
return 1;
}
WideCharToMultiByte(CP_UTF8, 0, argv_w[i], -1, argv[i], len, NULL, NULL);
}
Append the following after rc = spawnvp(P_WAIT, my_perl, (char* const*)argv);
LocalFree(argv_w);
for (int i = 0; i < argc; i++) {
free(argv[i]);
}
free(argv);
The installation steps are as follows:
strawberry-perl-5.38.0.1-64bit.msi
.cpanm -f PAR::Packer
.C:\Users\lenovo\.cpanm\work\1706663183.21600\PAR-Packer-1.061\myldr\boot.c
according to the description provided.cd C:\Users\lenovo\.cpanm\work\1706663183.21600\PAR-Packer-1.061
perl Makefile.PL
make
make test
make install
The content of myperl.pl is as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw/decode/;
use utf8;
binmode(STDOUT, ":utf8");
@ARGV = map { decode("utf-8", $_) } @ARGV;
print join(", ", @ARGV);
Next, run the following command in cmd to create an executable file from the script and test it.
pp -o myperl.exe myperl.pl
chcp 65001
myperl.exe äö.txt äö.txt
Finally, the cmd window prints äö.txt, äö.txt
.
I committed a slightly edited version of your patch to branch "ci". Can you test with that?
I have tested the branch "ci" and it works well on my Windows system.
I have conducted new tests and found some issues.
When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö
in cmd window, it displays:
一, �?�?�?�?�?�?�?�?�?äö
Here, 一 二 三 四 五 六 七 八 九 零
are Simplified Chinese characters, which translate to one two three four five six seven eight nine zero
in English.
If I run perl CmdArgs.pl 一 二 三 四 五 六 七 八 九 零 äö
in cmd window, it displays:
<perl>
<CmdArgs.pl>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>
When I made the following modification in boot.c:
Embed the following code after argv[wargc] = NULL;
FILE *file = fopen("E:\\argv.txt", "w");
if (file != NULL) {
for (int i = 0; i < argc; i++) {
fprintf(file, "argv[%d]: %s\n", i, argv[i]);
}
fclose(file);
}
The content of E:\argv.txt is as follows:
argv[0]: myperl.exe
argv[1]: 一
argv[2]: 二
argv[3]: 三
argv[4]: 四
argv[5]: 五
argv[6]: 六
argv[7]: 七
argv[8]: 八
argv[9]: 九
argv[10]: 零
argv[11]: äö
I cannot solve this problem and hope to receive your help.
The content of E:\argv.txt is as follows:
OK, as expected. Now let's see what the custom perl interpreter sees (before perl gets to work).
Please apply the folowing patch to myldr/main.c
diff --git a/myldr/main.c b/myldr/main.c
index ffb8790..d0cb1b5 100644
--- a/myldr/main.c
+++ b/myldr/main.c
@@ -99,6 +99,18 @@ int main ( int argc, char **argv, char **env )
PL_exit_flags |= PERL_EXIT_EXPECTED;
#endif /* PERL_EXIT_EXPECTED */
+ {
+ int i;
+ unsigned char *p;
+ printf("main.c argv:\n");
+ for (i = 0; i < argc; i++) {
+ printf("[%i]", i);
+ for (p = (unsigned char*)argv[i]; *p; p++)
+ printf(" %02X", *p);
+ printf("\n");
+ }
+ }
+
fakeargc = argc + 3; /* allow for "-e", my_par_pl, "--" arguments */
#ifdef PERL_PROFILING
fakeargc++; /* "-d:DProf" */
rebuild and install PAR::Packer
and repack myperl.exe. What's the output when you run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö
? And for comparison, what's the output for (in the PAR::Packer
build directory) myldr\par.exe 一 二 三 四 五 六 七 八 九 零 äö
?
When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö
, cmd displays
main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 62 37 32 62 34 61 33 31 65 38 66 66 32 64 63 61 37 31 34 37 37 37 31 32 33 61 35 32 63 33 65 65 35 33 35 30 61 30 32 32 2F 6D 79 70 65 72 6C 2E 65 78 65
[1] E4 B8 80
[2] E4 BA 3F E4 B8 3F E5 9B 3F E4 BA 3F E5 85 3F E4 B8 3F E5 85 3F E4 B9 3F E9 9B 3F C3 A4 C3 B6
一, �?�?�?�?�?�?�?�?�?äö
When I run myldr\par.exe 一 二 三 四 五 六 七 八 九 零 äö
, cmd displays
main.c argv:
[0] 6D 79 6C 64 72 5C 70 61 72 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
par.pl: Can't open perl script "一": No such file or directory
I found the content below from https://learn.microsoft.com/en-us/cpp/c-runtime-library/spawn-wspawn-functions?view=msvc-170
The _spawn functions each create and execute a new process. They automatically handle multibyte-character
string arguments as appropriate, recognizing multibyte-character sequences according to the multibyte
code page currently in use. The _wspawn functions are wide-character versions of the _spawn functions;
they don't handle multibyte-character strings. Otherwise, the _wspawn functions behave identically to
their _spawn counterparts.
I made the following modification in boot.c: Add
#include <wchar.h>
#include <stdlib.h>
#include <process.h>
after #include <windows.h>
Replace rc = spawnvp(P_WAIT, my_perl, (char* const*)argv);
with
size_t my_perl_len = strlen(my_perl) + 1;
wchar_t *w_my_perl = malloc(my_perl_len * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, my_perl, -1, w_my_perl, my_perl_len);
wchar_t **w_argv = malloc((argc + 1) * sizeof(wchar_t*));
for (int i = 0; i < argc; ++i) {
size_t arg_len = strlen(argv[i]) + 1;
w_argv[i] = malloc(arg_len * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, argv[i], -1, w_argv[i], arg_len);
}
w_argv[argc] = NULL;
rc = _wspawnvp(_P_WAIT, w_my_perl, (const wchar_t* const*)w_argv);
for (int i = 0; i < argc; ++i) {
free(w_argv[i]);
}
free(w_argv);
free(w_my_perl);
rebuild and install PAR::Packer and repack myperl.exe and CmdArgs.exe.
When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö
, cmd displays
main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 35 64 61 36 32 64 31 62 62 32 61 65 66 33 33 64 34 39 62 35 31 66 37 33 66 66 30 36 36 34 35 34 32 65 31 38 36 62 30 39 2F 6D 79 70 65 72 6C 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
һ, ��, ��, ��, ��, ��, ��, ��, ��, ��, ??
When I run CmdArgs.exe 一 二 三 四 五 六 七 八 九 零 äö
, cmd displays
main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 38 37 32 30 30 38 63 32 64 31 31 66 38 36 34 34 65 65 66 63 33 36 37 32 35 32 39 66 38 62 38 61 33 37 62 62 65 33 34 64 2F 43 6D 64 41 72 67 73 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
<C:\Users\lenovo\AppData\Local\Temp\par-6c656e6f766f\cache-872008c2d11f8644eefc3672529f8b8a37bbe34d/CmdArgs.exe>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>
I made two changes:
boot.c
(except for the "quote argv strings if necessary" part, temporarily commented out), use CommandLineToArgvW(GetCommandLineW(), &argc))
to get the original command line. my_perl
, but I think CP_UTF8 is wrong here, it's encoded in whatever the "current" codepage is set to.I pushed this to the "ci" branch, it passes all tests except for the ones testing Windows command line quoting rules. I'll have to reimplement them using wchar instead of char.
Please test.
Thanks again! I tested the latest "ci" branch, and when I ran CmdArgs.exe 一 二 三 四 五 六 七 八 九 零 äö
or CmdArgs.exe "一 二" 三 四 五 六 七 八 九 零 äö
, cmd displayed
<C:\Users\Xu\AppData\Local\Temp\par-5875\cache-f398e2715e4659c6ca2ef0ccdec734fe1265a7a5/CmdArgs.exe>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>
I made the following modification in boot.c:
wchar_t* shell_quote_wide(const wchar_t *src)
{
/* some characters from src may be replaced with two chars,
* add enclosing quotes and trailing \0 */
wchar_t *dst = malloc((2 * wcslen(src) + 3) * sizeof(wchar_t));
const wchar_t *p = src;
wchar_t *q = dst;
wchar_t c;
*q++ = L'"'; /* opening quote */
while ((c = *p))
{
if (c == L'\\')
{
int n = wcsspn(p, L"\\"); /* span of backslashes starting at p */
wmemcpy(q, p, n);
q += n;
if (p[n] == L'\0' || p[n] == L'"') /* span ends in quote or NUL */
{
wmemcpy(q, p, n);
q += n;
}
p += n; /* advance over the span */
continue;
}
if (c == L'"')
*q++ = L'\\'; /* escape the following quote */
*q++ = c;
p++;
}
*q++ = L'"'; /* closing quote */
*q++ = L'\0';
return dst;
}
after
return dst;
}
for (i = 0; i < argc; i++)
{
len = wcslen(w_argv[i]);
if (len == 0
|| w_argv[i][len-1] == L'\\'
|| wcspbrk(w_argv[i], L" \t\n\r\v\""))
{
w_argv[i] = shell_quote_wide(w_argv[i]);
}
}
before
rc = _wspawnvp(P_WAIT, w_my_perl, (char* const*)w_argv);
After rebuilding and installing PAR::Packer and repacking CmdArgs.exe, when I ran CmdArgs.exe "一 二" 三 四 五 六 七 八 九 零 äö
, cmd displayed
<C:\Users\Xu\AppData\Local\Temp\par-5875\cache-2d760f124f44a0df6bd8ac7b201d43c9f8b3a245/CmdArgs.exe>
<一 二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>
Thank you so much for your help! The latest version of "ci" branch solves my problem.
The answer provided in "Handling wide char values returned by Win32::API" can parse UTF-8 command line arguments on windows.
But with Par Packer packaging, the parsing failed.
If I save this code
in CmdArgs.pl and run
perl CmdArgs.pl äö.txt
, the cmd window displaysI then packaged the CmdArgs.pl with the Par Packer command
pp -o CmdArgs.exe CmdArgs.pl
, When I runCmdArgs.exe äö.txt
, the cmd window showsWhat I want it to show is
Info:
ACP
,OEMCP
andMACCP
in the registryHKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
are 936, 936, and 10008.chcp 65001
incmd
to change the console's CP.Debugging:
Win32::GetConsoleOutputCP()
returns 65001 (UTF-8) as expected.$args->[ $#$args ]
is3F.3F.2E.74.78.74
(as returned bysprintf "%vX"
).$cmd_line
is….2F.43.6D.64.41.72.67.73.2E.65.78.65.20.3F.3F.2E.74.78.74
.GetCommandLineW
is….2F.0.43.0.6D.0.64.0.41.0.72.0.67.0.73.0.2E.0.65.0.78.0.65.0.20.0.3F.0.3F.0.2E.0.74.0.78.0.74.0
.I also asked on stackoverflow see (https://stackoverflow.com/questions/77880391/parpacker-packaged-scripts-lose-the-ability-to-parse-utf-8-arguments-from-the), but it looks like the problem was caused by a PAR-Packer.