rschupp / PAR-Packer

(perl) Generate stand-alone executables, perl scripts and PAR files https://metacpan.org/pod/PAR::Packer
Other
48 stars 13 forks source link

PAR::Packer packaged scripts lose the ability to parse UTF-8 arguments from the command line #84

Closed fengzyf closed 6 months ago

fengzyf commented 7 months ago

The answer provided in "Handling wide char values returned by Win32::API" can parse UTF-8 command line arguments on windows.

But with Par Packer packaging, the parsing failed.

If I save this code

use strict;
use warnings;
use feature qw( say state );

use open ':std', ':encoding('.do { require Win32; "cp".Win32::GetConsoleOutputCP() }.')';

use Config     qw( %Config );
use Encode     qw( decode encode );
use Win32::API qw( ReadMemory );

use constant PTR_SIZE => $Config{ptrsize};

use constant PTR_PACK_FORMAT =>
     PTR_SIZE == 8 ? 'Q'
   : PTR_SIZE == 4 ? 'L'
   : die("Unrecognized ptrsize\n");

use constant PTR_WIN32API_TYPE =>
     PTR_SIZE == 8 ? 'Q'
   : PTR_SIZE == 4 ? 'N'
   : die("Unrecognized ptrsize\n");

sub lstrlenW {
   my ($ptr) = @_;

   state $lstrlenW = Win32::API->new('kernel32', 'lstrlenW', PTR_WIN32API_TYPE, 'i')
      or die($^E);

   return $lstrlenW->Call($ptr);
}

sub decode_LPCWSTR {
   my ($ptr) = @_;
   return undef if !$ptr;

   my $num_chars = lstrlenW($ptr)
      or return '';

   return decode('UTF-16le', ReadMemory($ptr, $num_chars * 2));
}

# Returns true on success. Returns false and sets $^E on error.
sub LocalFree {
   my ($ptr) = @_;

   state $LocalFree = Win32::API->new('kernel32', 'LocalFree', PTR_WIN32API_TYPE, PTR_WIN32API_TYPE)
      or die($^E);

   return $LocalFree->Call($ptr) == 0;
}

sub GetCommandLine {
   state $GetCommandLine = Win32::API->new('kernel32', 'GetCommandLineW', '', PTR_WIN32API_TYPE)
      or die($^E);

   return decode_LPCWSTR($GetCommandLine->Call());
}

# Returns a reference to an array on success. Returns undef and sets $^E on error.
sub CommandLineToArgv {
   my ($cmd_line) = @_;

   state $CommandLineToArgv = Win32::API->new('shell32', 'CommandLineToArgvW', 'PP', PTR_WIN32API_TYPE)
      or die($^E);

   my $cmd_line_encoded = encode('UTF-16le', $cmd_line."\0");
   my $num_args_buf = pack('i', 0);  # Allocate space for an "int".

   my $arg_ptrs_ptr = $CommandLineToArgv->Call($cmd_line_encoded, $num_args_buf)
      or return undef;

   my $num_args = unpack('i', $num_args_buf);
   my @args =
      map { decode_LPCWSTR($_) }
         unpack PTR_PACK_FORMAT.'*',
            ReadMemory($arg_ptrs_ptr, PTR_SIZE * $num_args);

   LocalFree($arg_ptrs_ptr);
   return \@args;
}

{
   my $cmd_line = GetCommandLine();

   say $cmd_line;

   my $args = CommandLineToArgv($cmd_line)
      or die("CommandLineToArgv: $^E\n");

   for my $arg (@$args) {
      say "<$arg>";
   }
}

in CmdArgs.pl and run perl CmdArgs.pl äö.txt, the cmd window displays

perl  CmdArgs.pl äö.txt
<perl>
<CmdArgs.pl>
<äö.txt>

I then packaged the CmdArgs.pl with the Par Packer command pp -o CmdArgs.exe CmdArgs.pl, When I run CmdArgs.exe äö.txt, the cmd window shows

C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe ??.txt
<C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe>
<??.txt>

What I want it to show is

C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe äö.txt
<C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe>
<äö.txt>

Info:

Debugging:

I also asked on stackoverflow see (https://stackoverflow.com/questions/77880391/parpacker-packaged-scripts-lose-the-ability-to-parse-utf-8-arguments-from-the), but it looks like the problem was caused by a PAR-Packer.

rschupp commented 7 months ago

I also asked on stackoverflow see (https://stackoverflow.com/questions/77880391/parpacker-packaged-scripts-lose-the-ability-to-parse-utf-8-arguments-from-the), but it looks like the problem was caused by a PAR-Packer.

You didn't read carefully: ikegami wrote "Works for me".

I can't test this anyway (no Windows here), but I'm curious: what is the actual problem you're trying to solve? Why do you need GetCommandLineW, what's wrong with @ARGV?

fengzyf commented 7 months ago

The actual problem: Without changing the basic Settings in "Info", I want to pass the utf-8 character "äö.txt" as an argument to my perl script,

I found ikegami's answer in https://stackoverflow.com/a/63868721:

Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le. Perl uses the A version of all system calls. That includes the call to get the command line.

The ACP is hard-coded. chcp changes the console's CP, but not the encoding used by the A system calls. 

The choice of console's CP (as set by chcp) has no effect on how Perl receives the command line. Because Perl uses the A version of the system calls, the command line will be encoded using the ACP regardless of the console's CP and the OEM CP.

But what if you wanted to support arbitrary Unicode characters instead of being limited to those found in your system's ACP? As mentioned above, you could [change](https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page) perl's ACP. Changing it to 650001 (UTF-8) would give you access to the entire Unicode character set.

Short of doing that, you would need to get the command line from the OS using the W version of the system call and parse it.

While Perl uses the A version of system calls, this doesn't limit modules from doing the same. They may use W system calls.[3] So maybe there's a module that does what you need. If not, I've previously written [code](https://stackoverflow.com/a/44489228/589924) that does just that.

Because ACP in my OS is 936 and it supports encoding gb2312, "ä" and "ö" are not within the range supported by gb2312, so "äö" has been replaced by "??" when passed to @ARGV.

Therefore, I chose to use the script provided by ikegami in Handling wide char values returned by Win32::API to obtain command line parameters through GetCommandLineW. This time it was successful passing "äö" to my script. But when I packaged it with PAR::Packer, "äö" was passed back to "??".

If I change the value of ACP in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage from 936 to 65001, When I run CmdArgs.exe äö.txt, "äö" will be passed correctly. This explains why ikegami wrote "Works for me."

I guess the reason for the occurrence is: When I run CmdArgs.exe äö.txt, CmdArgs.exe automatically decompresses to a temporary directory, recording the new execution file path as $path_new and the command line parameter @args_new (which also only supports characters in ACP (936), "äö" has been replaced by "??"), and then runs a new command based on $path_new and @args_new. Therefore, the command line parameter obtained by GetCommandLineW is C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe ??.txt.

Again, I don't want to change the ACP from 936 to 65001. See https://stackoverflow.com/a/68066008.

fengzyf commented 6 months ago

Do you have any opinion on my guess? Where can I modify the so-called @args_new if something similar exists?

rschupp commented 6 months ago

I guess the reason for the occurrence is: When I run CmdArgs.exe äö.txt, CmdArgs.exe automatically decompresses to a temporary directory, recording the new execution file path as $path_new and the command line parameter @args_new (which also only supports characters in ACP (936), "äö" has been replaced by "??"), and then runs a new command based on $path_new and @args_new. Therefore, the command line parameter obtained by GetCommandLineW is C:\Users\lenovo\AppData\Local\Temp\par-5875\cache-622ad9aefc40f2ce41f40707644eb592d3c73c6a/CmdArgs.exe ??.txt.

That's basically correct. A few more details:

Note: The problem with command line handling for Perl on Windows is known, e.g. @ARGV, -CA and Win32, but no solution so far.

fengzyf commented 6 months ago

Thank you very much for your help! I am not a professional programmer, I am not familiar with C language. With the help of chatgpt, I have solved this problem. If there are any areas that can be modified, please let me know.

I made changes only to boot.c: Embed the following code before par_init_env();;

    LPWSTR lpCmdLine = GetCommandLineW();

    LPWSTR* argv_w = CommandLineToArgvW(lpCmdLine, &argc);
    if (argv_w == NULL) {
        printf("Failed to parse command line\n");
        return 1;
    }

    ** argv = (char**)malloc(argc * sizeof(char*));
    if (argv == NULL) {
        printf("Failed to allocate memory\n");
        LocalFree(argv_w);
        return 1;
    }

    for (int i = 0; i < argc; i++) {
        int len = WideCharToMultiByte(CP_UTF8, 0, argv_w[i], -1, NULL, 0, NULL, NULL);
        if (len == 0) {
            printf("Failed to convert argument %d\n", i);
            LocalFree(argv_w);
            free(argv);
            return 1;
        }

        argv[i] = (char*)malloc(len * sizeof(char));
        if (argv[i] == NULL) {
            printf("Failed to allocate memory\n");
            LocalFree(argv_w);
            for (int j = 0; j < i; j++) {
                free(argv[j]);
            }
            free(argv);
            return 1;
        }

        WideCharToMultiByte(CP_UTF8, 0, argv_w[i], -1, argv[i], len, NULL, NULL);
    }

Append the following after rc = spawnvp(P_WAIT, my_perl, (char* const*)argv);

    LocalFree(argv_w);
    for (int i = 0; i < argc; i++) {
        free(argv[i]);
    }
    free(argv);

The installation steps are as follows:

The content of myperl.pl is as follows:

#!/usr/bin/perl

use strict;
use warnings;
use Encode   qw/decode/;
use utf8;
binmode(STDOUT, ":utf8");

@ARGV = map { decode("utf-8", $_) } @ARGV;
print join(", ", @ARGV);

Next, run the following command in cmd to create an executable file from the script and test it.

pp -o myperl.exe myperl.pl
chcp 65001
myperl.exe äö.txt äö.txt

Finally, the cmd window prints äö.txt, äö.txt.

rschupp commented 6 months ago

I committed a slightly edited version of your patch to branch "ci". Can you test with that?

fengzyf commented 6 months ago

I have tested the branch "ci" and it works well on my Windows system.

fengzyf commented 6 months ago

I have conducted new tests and found some issues. When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö in cmd window, it displays: 一, �?�?�?�?�?�?�?�?�?äö Here, 一 二 三 四 五 六 七 八 九 零 are Simplified Chinese characters, which translate to one two three four five six seven eight nine zero in English.

If I run perl CmdArgs.pl 一 二 三 四 五 六 七 八 九 零 äö in cmd window, it displays:

<perl>
<CmdArgs.pl>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>

When I made the following modification in boot.c: Embed the following code after argv[wargc] = NULL;

    FILE *file = fopen("E:\\argv.txt", "w");
    if (file != NULL) {
        for (int i = 0; i < argc; i++) {
            fprintf(file, "argv[%d]: %s\n", i, argv[i]);
        }
        fclose(file);
    }

The content of E:\argv.txt is as follows:

argv[0]: myperl.exe
argv[1]: 一
argv[2]: 二
argv[3]: 三
argv[4]: 四
argv[5]: 五
argv[6]: 六
argv[7]: 七
argv[8]: 八
argv[9]: 九
argv[10]: 零
argv[11]: äö

I cannot solve this problem and hope to receive your help.

rschupp commented 6 months ago

The content of E:\argv.txt is as follows:

OK, as expected. Now let's see what the custom perl interpreter sees (before perl gets to work). Please apply the folowing patch to myldr/main.c

diff --git a/myldr/main.c b/myldr/main.c
index ffb8790..d0cb1b5 100644
--- a/myldr/main.c
+++ b/myldr/main.c
@@ -99,6 +99,18 @@ int main ( int argc, char **argv, char **env )
     PL_exit_flags |= PERL_EXIT_EXPECTED;
 #endif /* PERL_EXIT_EXPECTED */

+    {
+        int i;
+        unsigned char *p;
+        printf("main.c argv:\n");
+        for (i = 0; i < argc; i++) {
+            printf("[%i]", i);
+            for (p = (unsigned char*)argv[i]; *p; p++)
+                printf(" %02X", *p);
+            printf("\n");
+        }
+    }
+
     fakeargc = argc + 3;        /* allow for "-e", my_par_pl, "--" arguments */
 #ifdef PERL_PROFILING
     fakeargc++;                 /* "-d:DProf" */

rebuild and install PAR::Packer and repack myperl.exe. What's the output when you run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö? And for comparison, what's the output for (in the PAR::Packer build directory) myldr\par.exe 一 二 三 四 五 六 七 八 九 零 äö?

fengzyf commented 6 months ago

When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö, cmd displays

main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 62 37 32 62 34 61 33 31 65 38 66 66 32 64 63 61 37 31 34 37 37 37 31 32 33 61 35 32 63 33 65 65 35 33 35 30 61 30 32 32 2F 6D 79 70 65 72 6C 2E 65 78 65
[1] E4 B8 80
[2] E4 BA 3F E4 B8 3F E5 9B 3F E4 BA 3F E5 85 3F E4 B8 3F E5 85 3F E4 B9 3F E9 9B 3F C3 A4 C3 B6
一, �?�?�?�?�?�?�?�?�?äö

When I run myldr\par.exe 一 二 三 四 五 六 七 八 九 零 äö, cmd displays

main.c argv:
[0] 6D 79 6C 64 72 5C 70 61 72 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
par.pl: Can't open perl script "一": No such file or directory
fengzyf commented 6 months ago

I found the content below from https://learn.microsoft.com/en-us/cpp/c-runtime-library/spawn-wspawn-functions?view=msvc-170

The _spawn functions each create and execute a new process. They automatically handle multibyte-character 
string arguments as appropriate, recognizing multibyte-character sequences according to the multibyte 
code page currently in use. The _wspawn functions are wide-character versions of the _spawn functions; 
they don't handle multibyte-character strings. Otherwise, the _wspawn functions behave identically to 
their _spawn counterparts.

I made the following modification in boot.c: Add

#include <wchar.h>
#include <stdlib.h>
#include <process.h>

after #include <windows.h>

Replace rc = spawnvp(P_WAIT, my_perl, (char* const*)argv); with

    size_t my_perl_len = strlen(my_perl) + 1;
    wchar_t *w_my_perl = malloc(my_perl_len * sizeof(wchar_t));
    MultiByteToWideChar(CP_UTF8, 0, my_perl, -1, w_my_perl, my_perl_len);

    wchar_t **w_argv = malloc((argc + 1) * sizeof(wchar_t*));
    for (int i = 0; i < argc; ++i) {
        size_t arg_len = strlen(argv[i]) + 1;
        w_argv[i] = malloc(arg_len * sizeof(wchar_t));
        MultiByteToWideChar(CP_UTF8, 0, argv[i], -1, w_argv[i], arg_len);
    }
    w_argv[argc] = NULL;

    rc = _wspawnvp(_P_WAIT, w_my_perl, (const wchar_t* const*)w_argv);

    for (int i = 0; i < argc; ++i) {
        free(w_argv[i]);
    }
    free(w_argv);

    free(w_my_perl);

rebuild and install PAR::Packer and repack myperl.exe and CmdArgs.exe. When I run myperl.exe 一 二 三 四 五 六 七 八 九 零 äö, cmd displays

main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 35 64 61 36 32 64 31 62 62 32 61 65 66 33 33 64 34 39 62 35 31 66 37 33 66 66 30 36 36 34 35 34 32 65 31 38 36 62 30 39 2F 6D 79 70 65 72 6C 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
һ, ��, ��, ��, ��, ��, ��, ��, ��, ��, ??

When I run CmdArgs.exe 一 二 三 四 五 六 七 八 九 零 äö, cmd displays

main.c argv:
[0] 43 3A 5C 55 73 65 72 73 5C 6C 65 6E 6F 76 6F 5C 41 70 70 44 61 74 61 5C 4C 6F 63 61 6C 5C 54 65 6D 70 5C 70 61 72 2D 36 63 36 35 36 65 36 66 37 36 36 66 5C 63 61 63 68 65 2D 38 37 32 30 30 38 63 32 64 31 31 66 38 36 34 34 65 65 66 63 33 36 37 32 35 32 39 66 38 62 38 61 33 37 62 62 65 33 34 64 2F 43 6D 64 41 72 67 73 2E 65 78 65
[1] D2 BB
[2] B6 FE
[3] C8 FD
[4] CB C4
[5] CE E5
[6] C1 F9
[7] C6 DF
[8] B0 CB
[9] BE C5
[10] C1 E3
[11] 3F 3F
<C:\Users\lenovo\AppData\Local\Temp\par-6c656e6f766f\cache-872008c2d11f8644eefc3672529f8b8a37bbe34d/CmdArgs.exe>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>
rschupp commented 6 months ago

I made two changes:

I pushed this to the "ci" branch, it passes all tests except for the ones testing Windows command line quoting rules. I'll have to reimplement them using wchar instead of char.

Please test.

fengzyf commented 6 months ago

Thanks again! I tested the latest "ci" branch, and when I ran CmdArgs.exe 一 二 三 四 五 六 七 八 九 零 äö or CmdArgs.exe "一 二" 三 四 五 六 七 八 九 零 äö, cmd displayed

<C:\Users\Xu\AppData\Local\Temp\par-5875\cache-f398e2715e4659c6ca2ef0ccdec734fe1265a7a5/CmdArgs.exe>
<一>
<二>
<三>
<四>
<五>
<六>
<七>
<八>
<九>
<零>
<äö>

I made the following modification in boot.c:

wchar_t* shell_quote_wide(const wchar_t *src)
{
    /* some characters from src may be replaced with two chars,
     * add enclosing quotes and trailing \0 */
    wchar_t *dst = malloc((2 * wcslen(src) + 3) * sizeof(wchar_t));

    const wchar_t *p = src;
    wchar_t *q = dst;
    wchar_t c;

    *q++ = L'"';                         /* opening quote */

    while ((c = *p))
    {
        if (c == L'\\')
        {
            int n = wcsspn(p, L"\\");    /* span of backslashes starting at p */

            wmemcpy(q, p, n);
            q += n;

            if (p[n] == L'\0' || p[n] == L'"') /* span ends in quote or NUL */
            {
                wmemcpy(q, p, n);
                q += n;
            }

            p += n;                     /* advance over the span */
            continue;
        }

        if (c == L'"')
            *q++ = L'\\';                /* escape the following quote */
        *q++ = c;
        p++;
    }

    *q++ = L'"';                         /* closing quote */
    *q++ = L'\0';

    return dst;
}

after

    return dst;
}
fengzyf commented 6 months ago

Thank you so much for your help! The latest version of "ci" branch solves my problem.