[discussion] Native unicode support

avih commented 1 year ago

So, I'm looking into adding native unicode support.

Basically similar to what the utf8 manifest does, but manually (adding U variant of win32 APIs which use utf8 prototypes and then internally calls the W APIs - instead of letting the manifest change the A APIs into utf8).

So I created this issue to ask related questions instead of opening an issue on each question.

First question: I might need to maintain a utf8 version of environ independent of the system environ (with its own *env API). Does busybox-w32 expect putenv to add the string to the environment, so that e.g. this prints 1 ?

char xenv[] = "x=0";
putenv(xenv);
xenv[2] = '1';
printf("%s\n", getenv("x"));

With POSIX *env API (and on linux) it prints 1, but on windows, as far as I can tell _putenv makes a copy of xenv, so the user's string doesn't actually become part of the environment, and it prints 0.

Also, with POSIX, it's valid to change environ to point to something else (user-provided array, for instance).

I see that mingw_putenv does a trick to set an empty value, and it's indeed aware that the buffer gets copied (hence environ is iterated to find the copy to truncate).

Do you know if these use cases (change the buffer after putenv, or change environ) happen with upstream busybox? If yes, how does it work on windows?

(setenv does make a copy, and so it can be more efficient to access using something less linear than environ).

rmyorston commented 1 year ago

I'm not aware of any case which relies on being able to change the buffer after putenv.

ash changes environ in evalcommand() when running a nofork applet. The existing value of environ is saved and restored after the applet completes.

avih commented 1 year ago

Thanks.

I ended up (ab)using the ANSI environ to store the UTF8 values, same as I did 3 years ago here https://github.com/rmyorston/busybox-w32/issues/17#issuecomment-681927971 so basically the native unicode thing uses the ANSI environ normally as far as the system goes, just that the encoding of non-ascii values is utf8 instead of ACP (and the mingw getenv/putenv/etc wrappers remain as is).

So any trick busybox or busybox-w32 do (like with the empty values, or the clearenv implementation), should work out of the box also with the native unicode mode.

ash changes environ in evalcommand()

As far as I can tell, when spawn is used - it takes the process env - not the CRT environ, but the process env is hopefully up to date because putenv updates both the crt environ (and possibly also _wenviron) and also the process env (e.g. using SetEnvironmentVariable[W]).

But if you modify environ to point to something else, how does the process env gets updated to the new values?

Still on the subject of environ.

I see that there were some commits related to UCRT. As far as I can tell, the current state is that we can't spawn with UCRT with non-null env, or else it will crash?

And so, if we need to spawn with a custom env array, it basically clears the current env and sets it from this custom array, and then calls spawn with a NULL env so that it would be taken from the environment (which is now a copy of the custom array).

Is that correct?

Originally I meant to handle NULL env to spawn by simply replacing it with environ (which is then converted to a wide array before calling _wspawnve), so that would presumably crash with UCRT? (it's not a problem to update the process env if it's null, it's just less code to replace NULL with environ, even if possibly slower to convert all of environ to a wide array).

Can you reproduce the crash issue with non-NULL env? Any pointers how to build busybox-w32 with UCRT?

rmyorston commented 1 year ago

Nofork applets run in the same process as their parent shell. There's no spawn in this case.

Passing a non-NULL env to spawn with UCRT causes any subsequent spawn in the child process to fail. If the child process doesn't spawn a grandchild there's no problem. The workaround is to arrange things so that the initial spawn always has a NULL env.

I haven't reproduced this recently. When I worked on the problem I used the UCRT variant of MSYS2 on Windows.

avih commented 1 year ago

Passing a non-NULL env to spawn with UCRT causes any subsequent spawn in the child process to fail. If the child process doesn't spawn a grandchild there's no problem. The workaround is to arrange things so that the initial spawn always has a NULL env.

Thanks.

So how does one trigger the crash scenario as far as a user test case goes?

Also, I might want to put all the UTF8 APIs implementations in a new C file (maybe mingw-utf8.c or some such), possibly with a matching header.

How should I go about it as far as makefiles/other-requirements go? I tried grepping mingw.c and mingw.o (to check how mingw.c is compiled depending on CONFIG_PLATFORM_MINGW32) but couldn't find how this happens.

I see it's in win32/Kbuild. I used ag (the silver searcher) and apparently it skipped some files...

Do I need to mention the .h file someplace so that it rebuilds the .c file when the header changes?

rmyorston commented 1 year ago

So how does one trigger the crash scenario as far as a user test case goes?

You'd need to:

Revert the fix in the shell: sed -i 's/_UCRT/_WHATEVER/' shell/ash.c
Build busybox-w32 using the UCRT variant of MSYS2
Have the shell spawn an application that spawns some other application. In the original issue the first application was GNU make and I used busybox-w32 time as the second.

Do I need to mention the .h file someplace

That shouldn't be necessary.

avih commented 1 year ago

Thanks.

I still didn't get to try UCRT, but I'll update here once I do.

Regardless, before you make the next release, can I please review the release notes about unicode?

rmyorston commented 1 year ago

Draft release notes

avih commented 1 year ago

Draft release notes

Thanks. Generally looks good. Few minor comments:

It isn't necessary to adjust the console code page: this is handled transparently.

That might sound as if we're changing the console CP. Assuming we don't want to imply this, maybe remove "this is handled transparently", or maybe reword as something like "Unicode output and interactive input work with any console code page."

It has been necessary to apply workarounds for deficiencies in Microsoft's current support for UTF-8. There may be undiscovered problems and the workarounds themselves may have issues.

I believe this refers only to UTF8_INPUT and/or UTF8_OUTPUT? (I don't recall other workarounds)

If yes, then it might be useful to use something like "... deficiencies in Microsoft's current support for UTF-8 in interactive use and printouts to the console"

Or some other wording to limit the statement to console IO, as this doesn't refer, as far as I can tell, to the ability to work with files with unicode names, like with tar or cat.

The Windows Terminal works better than the Windows console.

Maybe add "especially when it comes to rendering unicode". I don't think there's difference in editing as long as the glyphs are supported.

As an additional note, I found out that other terminals also work better than the windows console, like vscode builtin terminal (amazing combining chars rendering), or hyper or wez (that's not an endorsement), and possibly more.

All the mentioned ones seem to handle combining chars better than the windows terminal, tested with this - check "The Three Kingdoms" alignment thingy (the whole page pastes just fine into a here-document in an interactive sh prompt ;) ).

However, all of those also don't have mouse-wheel events for console apps which support it (like less for windows), while such apps do work at the windows terminal/console.

But I agree that mentioning only the windows terminal is enough.

Editing (cursor movement, delete, etc) can be buggy, especially with double-width or combining characters.

A reminder that while it doesn't fix all busybox editing issues, I do have a version with wcwidth updated to latest Unicode (15). We already pushed a tiny update only for emojis, but the full update has much more. I can arrange quickly a PR with the full update if you want, which will be this, but it will use it unconditionally with UTF8_MANIFEST (and disable the existing table). Or just feel free to take this commit, edit it as you see fit, and push it yourself.

I don't know how much it affects practical use cases, but I believe it would affect that rocketship prompt which was mentioned in another issue, because that rocket glyph is double width, but current busybox[-w32] doesn't know that, so if it appears as part of PS1 at the last line of the prompt, busybox will get confused with the line wrap (of the user input which follows). The full update to Unicode 15 does include it as double width.

(mingw64-gcc 12.2.1-8.fc38; mingw64-crt 10.0.0-4.fc38; glob; Unicode)

As I think it's feasible to get the native unicode working, maybe change it to Unicode-manifest or Unicode-Win10 so that the difference becomes visible (I'm not suggesting again to also change the download file name).

Regardless of Unicode, this looks like another good release, and good notes. Thanks again for maintaining it.

rmyorston commented 1 year ago

I originally wrote

It isn't necessary to adjust the console code page: this is handled automatically.

but thought that might imply the code page was changed automatically. So I said 'transparently' instead.

End users probably don't care too much about the exact nature of Microsoft's deficiencies so I prefer to leave them unspecified.

I'll have a look at the wcwidth commit later.

The release certainly isn't going to be FRP-5179. Testing has uncovered problems on Wine which will require at least one workaround.

avih commented 1 year ago

I'll have a look at the wcwidth commit later.

Sure. Let me know if you need anything.

The release certainly isn't going to be FRP-5179. Testing has uncovered problems on Wine which will require at least one workaround.

Hmm... I presume that's in your private test setup?

However, all of those also don't have mouse-wheel events for console apps which support it (like less for windows), while such apps do work at the windows terminal/console.

First of all, I was wrong. Mouse events do work in wezterm, but not the others (vscode, hyper).

FWIW, this seems to be an issue of older (win10, but not win11) conpty.dll, which apparently doesn't do mouse events well.

wezterm bundles a newer conpty.dll, so mouse wheel works.

rmyorston commented 1 year ago

I've decided not to take the wcwidth commit for now. I always seem to have cause to regret last-minute changes made just before a release.

I'm not impressed with the latest version of Wine in Fedora. There are a number of regressions. I've worked around the most serious; the rest can wait. It's not like Wine is an important platform for busybox-w32.

Hmm... I presume that's in your private test setup?

Not private: testsuite/runtest in busybox-w32 works on Windows as well as Linux. There are more failures on Windows, but that's only to be expected.

I'll restart my testing with FRP-5181 and see how that goes.

avih commented 1 year ago

I've decided not to take the wcwidth commit for now

Sure. But if the unicode build gains traction, then I think it would be good to get this or something similar in. The upstream wcwidth is 15 years out of date, and the number of meme double-width codepoints/icons has grown considerably since then - both in official support and in various meme use cases, like very fancy PS1.

I'm not suggesting to push it upstream because TBH I don't think we should deal with the "compression" of the data at upstream wcwidth whenever a new Unicode version is released (I think Unicode 16 should be released in few weeks).

I don't think there are existing scripts to automate it from the Unicode data to the busybox "compression", so taking it from someplace which does automate it sounds a lot better to me, especially where we might have slightly less concerns with the balance of program size vs features.

Not private: testsuite/runtest in busybox-w32 works on Windows as well as Linux. There are more failures on Windows, but that's only to be expected.

Do the linux parts refer to testing busybox-w32 on linux using wine? or to testing the native linux busybox?

Anyway, back to the subject of this issue: native unicode support.

I've noticed that in some win32 files, libbb.h is included before other standard headers, for instance at fsync.c or glob.c or poll.c or even mingw.c.

Now, libbb.h includes mingw.h, and mingw.h has some mappings of default names, such as fopen to mingw_fopen, etc.

So the point is that I think libbb.h should be included last, so that it doesn't overrides system names before their respective system header is included.

I don't think I can identify a reason it's included first. Is there such reason?

If this is expected to work because libbb.h already includes these headers, and so they'll no-op the second #include, then I think it's kind of meh IMO, but would still be good to know.

I'm asking this because the utf8 API would need similar global mapping which would probably apply via mingw.h (like CreateFileA to createFile_U), but if windows.h is included after libbb.h, then this poses an issue (and same for other names which would require such mappings).

(of course, names which are already mapped globally currently, like mingw_fopen, would not have global mapping to the utf8 variants. These name would map to the utf8 variants at the C file where the current mingw/winansi wrapper is implemented).

rmyorston commented 1 year ago

Do the linux parts refer to testing busybox-w32 on linux using wine? or to testing the native linux busybox?

The latter. It's possible to build and test a native Linux binary using the busybox-w32 source. The test output doesn't exactly match that obtained using the upstream source. That's because some Windows tests are included when they really shouldn't be. I'll fix that.

libbb.h tends to be included first because that's how it's always been done:

fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    5) #include "libbb.h"
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    6) #include <windows.h>
1ee308c75f (Ron Yorston          2021-12-22 08:22:12 +0000    7) #include "lazyload.h"
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    8) #undef PACKED
fa37e9da4b (Nguyễn Thái Ngọc Duy 2010-04-14 02:12:01 +0200    9)

I've just been following that existing practice.

avih commented 1 year ago

I've just been following that existing practice.

Would you be against moving libbb.h it to the end?

For instance, mapping fprintf to mingw_printf would be disastarous if this happens before fprintf is declared...

Anyway, next subject.

I was trying to evaluate which/how-many APIs would need to be mapped from ansi to UTF8. Searching APIs at the code did not seem productive, so I wrote a script which produces this list of the ansi symbols used by busybox.exe.

List of ansi symbols used by `busybox.exe` (about 100)

``` $ w32sym.sh -A busybox.exe GetUserNameA CreateEventA CreateFileA CreateFileMappingA CreateNamedPipeA CreateProcessA FillConsoleOutputCharacterA FindFirstFileA FindNextFileA GetCompressedFileSizeA GetDiskFreeSpaceExA GetDriveTypeA GetFileAttributesA GetFileAttributesExA GetFullPathNameA GetModuleFileNameA GetModuleHandleA GetSystemDirectoryA GetVersionExA GetVolumeInformationA LoadLibraryA LoadLibraryExA MoveFileExA Process32First Process32Next ReadConsoleInputA SetConsoleTitleA SetEnvironmentVariableA SetFileAttributesA __argv __getmainargs __initenv _access _ctime64 _environ _fdopen _fullpath _getch _open _putenv _vscprintf _vsnprintf fgets fopen fprintf fputc fputs getenv isalnum isalpha iscntrl isgraph islower isprint ispunct isspace isupper isxdigit putc putchar rename setlocale sprintf sscanf strcat strchr strcmp strcpy strcspn strftime strlen strncmp strncpy strpbrk strrchr strspn strstr strtok strtol strtoul tolower toupper ungetc vfprintf _unlink _spawnve _rmdir _open _mktemp _mkdir _getcwd _getche _fdopen _creat _chmod _chdir DispatchMessageA PeekMessageA WSASocketA ```

List of all symbols used by `bysybox.exe`, with the ansi ones marked, with the respective wide symbol

``` $ w32sym.sh -a busybox.exe ADVAPI32.dll: AccessCheck CheckTokenMembership DuplicateToken EqualSid GetSecurityInfo GetTokenInformation [A] GetUserNameA (GetUserNameW) OpenProcessToken SaferComputeTokenFromLevel SaferCreateLevel SetTokenInformation SystemFunction036 KERNEL32.dll: CloseHandle CreateConsoleScreenBuffer [A] CreateEventA (CreateEventW) [A] CreateFileA (CreateFileW) [A] CreateFileMappingA (CreateFileMappingW) [A] CreateNamedPipeA (CreateNamedPipeW) CreatePipe [A] CreateProcessA (CreateProcessW) CreateRemoteThread CreateToolhelp32Snapshot DeleteCriticalSection DeviceIoControl DuplicateHandle EnterCriticalSection FileTimeToSystemTime FillConsoleOutputAttribute [A] FillConsoleOutputCharacterA (FillConsoleOutputCharacterW) FindClose [A] FindFirstFileA (FindFirstFileW) FindFirstVolumeW [A] FindNextFileA (FindNextFileW) FindNextVolumeW FindVolumeClose FlushFileBuffers FreeEnvironmentStringsW GenerateConsoleCtrlEvent GetACP GetCPInfo GetCommandLineW [A] GetCompressedFileSizeA (GetCompressedFileSizeW) GetConsoleCP GetConsoleMode GetConsoleOutputCP GetConsoleScreenBufferInfo GetConsoleWindow GetCurrentProcess [A] GetDiskFreeSpaceExA (GetDiskFreeSpaceExW) GetDiskFreeSpaceExW [A] GetDriveTypeA (GetDriveTypeW) GetEnvironmentStringsW GetEnvironmentVariableW GetExitCodeProcess [A] GetFileAttributesA (GetFileAttributesW) [A] GetFileAttributesExA (GetFileAttributesExW) GetFileInformationByHandle GetFileSizeEx GetFileType [A] GetFullPathNameA (GetFullPathNameW) GetLastError GetLogicalDrives [A] GetModuleFileNameA (GetModuleFileNameW) [A] GetModuleHandleA (GetModuleHandleW) GetNumberOfConsoleInputEvents GetProcAddress GetProcessAffinityMask GetProcessId GetProcessTimes GetStdHandle [A] GetSystemDirectoryA (GetSystemDirectoryW) GetSystemInfo GetSystemTimeAsFileTime GetTickCount [A] GetVersionExA (GetVersionExW) [A] GetVolumeInformationA (GetVolumeInformationW) GetVolumeInformationW InitializeCriticalSection IsDBCSLeadByteEx IsValidCodePage IsWow64Process LeaveCriticalSection [A] LoadLibraryA (LoadLibraryW) [A] LoadLibraryExA (LoadLibraryExW) LocalFree MapViewOfFile [A] MoveFileExA (MoveFileExW) MultiByteToWideChar OpenProcess PeekConsoleInputW PeekNamedPipe [A] Process32First (Process32FirstW) [A] Process32Next (Process32NextW) [A] ReadConsoleInputA (ReadConsoleInputW) ReadConsoleInputW ReadDirectoryChangesW ReadProcessMemory ResetEvent SetConsoleActiveScreenBuffer SetConsoleCP SetConsoleCtrlHandler SetConsoleCursorPosition SetConsoleMode SetConsoleOutputCP SetConsoleScreenBufferSize SetConsoleTextAttribute [A] SetConsoleTitleA (SetConsoleTitleW) SetEndOfFile [A] SetEnvironmentVariableA (SetEnvironmentVariableW) SetEnvironmentVariableW SetErrorMode [A] SetFileAttributesA (SetFileAttributesW) SetFilePointer SetFileTime SetHandleInformation SetLastError SetSystemTime SetUnhandledExceptionFilter Sleep SleepEx TerminateProcess TlsGetValue UnmapViewOfFile VirtualProtect VirtualQuery WaitForMultipleObjects WaitForSingleObject WideCharToMultiByte WriteConsoleW msvcrt.dll: __C_specific_handler ___mb_cur_max_func [A] __argv (__wargv) [A] __getmainargs (__wgetmainargs) [A] __initenv (__winitenv) __iob_func __set_app_type __setusermatherr [A] _access (_waccess) _amsg_exit _cexit _commode [A] _ctime64 (_wctime64) [A] _environ (_wenviron) _errno _exit [A] _fdopen (_wfdopen) _filbuf _fmode _fstat64 [A] _fullpath (_wfullpath) _get_osfhandle [A] _getch (_getwch) _gmtime64 _initterm _isatty _localtime64 _lseeki64 _mktime64 _onexit [A] _open (_wopen) _open_osfhandle _pipe [A] _putenv (_wputenv) _setjmp _setmode _stricmp _strnicmp _telli64 _time64 _timezone _tzset [A] _vscprintf (_vscwprintf) [A] _vsnprintf (_vsnwprintf) _wspawnve abort atof atoi bsearch calloc clearerr clock exit fclose feof ferror fflush [A] fgets (fgetws) [A] fopen (_wfopen) [A] fprintf (fwprintf) [A] fputc (fputwc) [A] fputs (fputws) fread free fseek fwrite [A] getenv (_wgetenv) [A] isalnum (iswalnum) [A] isalpha (iswalpha) [A] iscntrl (iswcntrl) [A] isgraph (iswgraph) [A] islower (iswlower) [A] isprint (iswprint) [A] ispunct (iswpunct) [A] isspace (iswspace) [A] isupper (iswupper) [A] isxdigit (iswxdigit) localeconv longjmp malloc mbstowcs memchr memcmp memmove [A] putc (putwc) [A] putchar (putwchar) qsort rand realloc [A] rename (_wrename) setbuf [A] setlocale (_wsetlocale) signal [A] sprintf (swprintf) srand [A] sscanf (swscanf) [A] strcat (wcscat) [A] strchr (wcschr) [A] strcmp (wcscmp) [A] strcpy (wcscpy) [A] strcspn (wcscspn) strerror [A] strftime (wcsftime) [A] strlen (wcslen) [A] strncmp (wcsncmp) [A] strncpy (wcsncpy) [A] strpbrk (wcspbrk) [A] strrchr (wcsrchr) [A] strspn (wcsspn) [A] strstr (wcsstr) [A] strtok (wcstok) [A] strtol (wcstol) [A] strtoul (wcstoul) [A] tolower (towlower) [A] toupper (towupper) [A] ungetc (ungetwc) [A] vfprintf (vfwprintf) wcschr wcscpy wcslen wcsncmp wcstombs _strtoui64 _strtoi64 _write _wcsnicmp [A] _unlink (_wunlink) _umask _tzset _strdup [A] _spawnve (_wspawnve) [A] _rmdir (_wrmdir) _read [A] _open (_wopen) [A] _mktemp (_wmktemp) [A] _mkdir (_wmkdir) _getpid [A] _getcwd (_wgetcwd) [A] _getche (_getwche) _fileno [A] _fdopen (_wfdopen) _dup2 _dup [A] _creat (_wcreat) _close [A] _chmod (_wchmod) [A] _chdir (_wchdir) _stricmp SHELL32.dll: CommandLineToArgvW USER32.dll: [A] DispatchMessageA (DispatchMessageW) MsgWaitForMultipleObjects [A] PeekMessageA (PeekMessageW) TranslateMessage WS2_32.dll: WSACleanup WSAEnumNetworkEvents WSAEventSelect WSAGetLastError WSASetLastError [A] WSASocketA (WSASocketW) WSAStartup __WSAFDIsSet accept bind closesocket connect freeaddrinfo getaddrinfo gethostbyaddr gethostname getnameinfo getpeername getservbyname inet_addr inet_ntoa listen recv select setsockopt shutdown ```

I'm guessing a good bunch of the ansi APIs don't need special handling, e.g. many IO APIs, like fprintf, already work fine with UTF8 too (i.e. no need to add a UTF8 wrapper) , and similarly for string APIs like strlen (which has an equivalent wcslen), or we probably don't need iswlower, etc.

Once we have native UTF8 working, we can add a white list of allowed ansi symbols, and then use the script occasionally to check if we need to add more UTF8 wrappers etc.

So this is largely an FYI, but I'd appreciate if you could glance over it and see if anything stands out as wrong.

For reference, here's the current script: w32sym.zip

avih commented 1 year ago

I'm guessing a good bunch of the ansi APIs don't need special handling...

Yeah, I think I overshoot a bit with the matching wide symbols.

These are correct matching symbols (e.g. strlen, wcslen) but they're only different because they operate on different type.

Compared, for instance, to [_w]fopen or CreateFile{A,W}, which have different reach - the ANSI ones are more limited compared to the wide ones, and these are the ones we mainly care about.

So here's the revised shorter list (about half) of ansi symbols + wide matches

``` $ w32sym.sh -a ./busybox.exe | grep '\[' [A] GetUserNameA (GetUserNameW) [A] CreateEventA (CreateEventW) [A] CreateFileA (CreateFileW) [A] CreateFileMappingA (CreateFileMappingW) [A] CreateNamedPipeA (CreateNamedPipeW) [A] CreateProcessA (CreateProcessW) [A] FillConsoleOutputCharacterA (FillConsoleOutputCharacterW) [A] FindFirstFileA (FindFirstFileW) [A] FindNextFileA (FindNextFileW) [A] GetCompressedFileSizeA (GetCompressedFileSizeW) [A] GetDiskFreeSpaceExA (GetDiskFreeSpaceExW) [A] GetDriveTypeA (GetDriveTypeW) [A] GetFileAttributesA (GetFileAttributesW) [A] GetFileAttributesExA (GetFileAttributesExW) [A] GetFullPathNameA (GetFullPathNameW) [A] GetModuleFileNameA (GetModuleFileNameW) [A] GetModuleHandleA (GetModuleHandleW) [A] GetSystemDirectoryA (GetSystemDirectoryW) [A] GetVersionExA (GetVersionExW) [A] GetVolumeInformationA (GetVolumeInformationW) [A] LoadLibraryA (LoadLibraryW) [A] LoadLibraryExA (LoadLibraryExW) [A] MoveFileExA (MoveFileExW) [A] Process32First (Process32FirstW) [A] Process32Next (Process32NextW) [A] ReadConsoleInputA (ReadConsoleInputW) [A] SetConsoleTitleA (SetConsoleTitleW) [A] SetEnvironmentVariableA (SetEnvironmentVariableW) [A] SetFileAttributesA (SetFileAttributesW) [A] __argv (__wargv) [A] __getmainargs (__wgetmainargs) [A] __initenv (__winitenv) [A] _access (_waccess) [A] _ctime64 (_wctime64) [A] _environ (_wenviron) [A] _fdopen (_wfdopen) [A] _fullpath (_wfullpath) [A] _open (_wopen) [A] _putenv (_wputenv) [A] fopen (_wfopen) [A] getenv (_wgetenv) [A] rename (_wrename) [A] setlocale (_wsetlocale) [A] _unlink (_wunlink) [A] _spawnve (_wspawnve) [A] _rmdir (_wrmdir) [A] _open (_wopen) [A] _mktemp (_wmktemp) [A] _mkdir (_wmkdir) [A] _getcwd (_wgetcwd) [A] _fdopen (_wfdopen) [A] _creat (_wcreat) [A] _chmod (_wchmod) [A] _chdir (_wchdir) [A] DispatchMessageA (DispatchMessageW) [A] PeekMessageA (PeekMessageW) [A] WSASocketA (WSASocketW) ```

And here's the revised script. By default it now produces the short list (limited ANSI APIs), but it can also produce the long list with the general ansi APIs: w32sym.v2.zip

But I'm guessing it can't detect symbols which are loaded dynamically? Not sure...

rmyorston commented 1 year ago

Would you be against moving libbb.h it to the end?

Not if it's strictly necessary.

But I'm guessing it can't detect symbols which are loaded dynamically? Not sure...

I guess not. They'll appear as strings in the binary not as symbols. Easy enough to find in the source, though.

avih commented 1 year ago

Have the shell spawn an application that spawns some other application. In the original issue the first application was GNU make and I used busybox-w32 time as the second.

Hmm.. I can't reproduce this.

I've built busybox.exe with UCRT on windows, using winlibs, i386, adding to the path a dir with busybox and the output of busybox.exe --install . minus ar, plus xxd, pkg-config and make from the w64devkit bin dir (it probably doesn't need make because winlibs has it, but it does seem to need a non-busybox xxd).

I got many warnings where LL_FMT and OFF_FMT are used - these seem to be ignored (it thinks "...%"LL_FMT"d" is %d, and I'm not sure why, but I know that it definitely goes through the #define LL_FMT "I64" line), but it did build, and it does depend/linked with ucrtbase.dll.

Example warning:

editors/awk.c:925:44: warning: format '%d' expects argument of type 'int', but argument 4 has type 'long long int' [-Wformat=]
  925 |                 snprintf(g_buf, MAXVARFMT, "%"LL_FMT"d", (long long)n);
      |                                            ^~~           ~~~~~~~~~~~~
      |                                                          |
      |                                                          long long int
editors/awk.c:925:54: note: format string is defined here
  925 |                 snprintf(g_buf, MAXVARFMT, "%"LL_FMT"d", (long long)n);
      |                                             ~~~~~~~~~^
      |                                                      |
      |                                                      int
      |                                             %"LL_FMT"lld

I then used a (wide) spawn variant which replaces NULL env with environ, so that spawn is never called with a NULL env, and does get called with environ env.

I couldn't quite figure out how to setup the make/time example, so I tried these, and all seem to work:

./busybox sh -c './busybox sh -c busybox'
./busybox sh -c 'sh -c "sh -c busybox"'
./busybox sh -c 'sh -c "./busybox printf xxx | ./busybox grep xxx"'

It would really help to have some one-liner test case to reproduce it...

Anyway, for now, I'll keep the current UCRT code which uses NULL env, and I'll handle exporting the UTF8 env to the system env, because that would also typically be much quicker to convert only he unicode values instead of converting the whole UTF8 environ to wide. It's not a big function so that should be fine.

But I would still like to be able to reproduce the UCRT crash when passing environ env, and then confirm that the NULL env fixes it.

avih commented 1 year ago

I tried these, and all seem to work:

./busybox sh -c './busybox sh -c busybox' ./busybox sh -c 'sh -c "sh -c busybox"' ./busybox sh -c 'sh -c "./busybox printf xxx | ./busybox grep xxx"'

I tried also with the latest winlibs x86-64, plain non-any-utf8 build, with only this diff to the source:

diff --git a/shell/ash.c b/shell/ash.c
index 5a5c947e8..ec11e8f56 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9126,7 +9126,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 # else
        if (APPLET_IS_NOEXEC(applet_no)) {
 # endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if 1 || ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
            /* If building for UCRT move this up into shellexec() to
             * work around a bug. */
            clearenv();
@@ -9203,7 +9203,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
    int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */

    envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if 0 && ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
    /* Avoid UCRT bug by updating parent's environment and passing a
     * NULL environment pointer to execve(). */
    clearenv();

And tried also this case:

./busybox sh -c 'X=x ./busybox sh -c "Z=z ./busybox"'

Which also works fine as far as I can tell..

As for the warning, I think I previously missed also this line, which is in line with my statement that it's defined as I64:

include/libbb.h:293:17: note: format string is defined here
  293 | #define LL_FMT "I64"
      |                 ^

So I'm guessing in winlibs it doesn't like I64? which is weird. It's defined like this:

#if ENABLE_PLATFORM_MINGW32 && \
    (!defined(__USE_MINGW_ANSI_STDIO) || !__USE_MINGW_ANSI_STDIO)
#define LL_FMT "I64"
#else
#define LL_FMT "ll"
#endif

So maybe __USE_MINGW_ANSI_STDIO is ignored/doesn't-exist with UCRT? this might make sense, because the mingw stdio exists to work around msvcrt issues, which maybe don't exist with UCRT, and so this define is gone?

rmyorston commented 1 year ago

I'm also unable to reproduce the problem. Maybe Microsoft have fixed UCRT.

avih commented 1 year ago

OK, I'll leave it up to you to decide what to do with the UCRT workaround.

Meanwhile, I'll just support the standard behavior with (utf8) spawn: allow and "forward" both NULL and non-NULL env, and if it's NULL then ensure the system wide env is up to date (instead of the smaller and slower hack of replacing NULL with environ and then converting all of it to wide).

As for the UCRT I64 warnings, I have a patch and will send a PR soon.

I'm Just confirming first it compiles without warnings with/without UCRT for i686/x86-64.

rmyorston commented 1 year ago

OK, this rabbit hole was a bit deeper than expected.

It bothered me that I was unable to reproduce the problem with UCRT. The original issue (#234) was related to an interaction between GNU make and busybox-w32 when they were both compiled with UCRT.

I was able to reproduce the problem using GNU make from back then but wanted to do it solely with current busybox-w32 compiled with the current MinGW-w64 UCRT toolchain.

The issue in GNU make is that it passes a non-default environment to CreateProcess(). In busybox-w32 CreateProcess() is used in the implementation of popen(3), so I hacked that to have a non-default environment:

diff --git a/win32/popen.c b/win32/popen.c
index 2208aa6bb..de2f2b1fb 100644
--- a/win32/popen.c
+++ b/win32/popen.c
@@ -190,6 +190,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
    int success;
    int fd = -1;
    int ip, ic, flags;
+   char env[] = "HELLO=1\0";

    if ( cmd == NULL || *cmd == '\0' || mode == NULL ) {
        return -1;
@@ -251,7 +252,7 @@ static int mingw_popen_internal(pipe_data *p, const char *cmd,
                NULL,              /* primary thread security attributes */
                TRUE,              /* handles are inherited */
                0,                 /* creation flags */
-               NULL,              /* use parent's environment */
+               env,
                NULL,              /* use parent's current directory */
                &siStartInfo,      /* STARTUPINFO pointer */
                &p->piProcInfo);   /* receives PROCESS_INFORMATION */

Then I modified the shell to force it to spawn all applets and reverted the UCRT workaround in the shell:

diff --git a/shell/ash.c b/shell/ash.c
index 2ea87a049..0f1862424 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -9103,7 +9103,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 {
 #if ENABLE_FEATURE_SH_STANDALONE
    if (applet_no >= 0) {
-# if ENABLE_PLATFORM_MINGW32
+# if ENABLE_PLATFORM_MINGW32 && 0
        /* Treat all applets as NOEXEC, including the shell itself if
         * this is a FS_SHELLEXEC shell. */
        struct forkshell *fs = (struct forkshell *)sticky_mem_start;
@@ -9124,7 +9124,7 @@ tryexec(IF_FEATURE_SH_STANDALONE(int applet_no,) const char *cmd, char **argv, c
 # else
        if (APPLET_IS_NOEXEC(applet_no)) {
 # endif
-#if ENABLE_PLATFORM_MINGW32 && !defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && !defined(_WHATEVER)
            /* If building for UCRT move this up into shellexec() to
             * work around a bug. */
            clearenv();
@@ -9201,7 +9201,7 @@ static void shellexec(char *prog, char **argv, const char *path, int idx)
    int applet_no = -1; /* used only by FEATURE_SH_STANDALONE */

    envp = listvars(VEXPORT, VUNSET, /*strlist:*/ NULL, /*end:*/ NULL);
-#if ENABLE_PLATFORM_MINGW32 && defined(_UCRT)
+#if ENABLE_PLATFORM_MINGW32 && defined(_WHATEVER)
    /* Avoid UCRT bug by updating parent's environment and passing a
     * NULL environment pointer to execve(). */
    clearenv();

To make the call to popen(3) I used this Makefile:

sum != sum Makefile

target:
    @echo sum is $(sum)

The shell macro assignment (!=) uses popen(3). Using a UCRT-compiled version of busybox-w32 make from current git master gives:

~ $ make
sum is 22355 1

The UCRT version without the workaround gives:

~ $ make
sum is

So, the workaround is still necessary. Without it GNU make still triggers the problem previously reported.

avih commented 1 year ago

Using a UCRT-compiled version of busybox-w32 make

It's good to know the workaround is still required, and apparently in _wspawnve as well, judging by the links below.

Reference:

Seems to be an investigation of the busybox-w32 case: https://github.com/xpack-dev-tools/ms-spawn-issue
In glib: https://gitlab.gnome.org/GNOME/glib/-/merge_requests/3289
Some MS community page: https://developercommunity.visualstudio.com/t/UCRT-Crash-in-_wspawne-functions/10262748

avih commented 11 months ago

Here are some more thoughts (i'm not abandoning the native utf8, but still).

Mingw-w64 recently changed their default to ucrt (at git master, probably for v12, if there wouldn't be too much backlash), and while w64devkit doesn't currently have plans to switch to ucrt, I'm guessing the defaults would eventually trickle down to most mingw setups.

ucrt applications can run on XP. It does require installing the ucrt runtime (the is/was an official runtime for XP), but then it works, even when compiling with llvm 17 (with some XP compiler and linker options - if the mingw setup defaults to win 7 or later).

Using a utf8 locale does not actually require the utf8 manifest. It can also be set using setlocale(".UTF8"). Similar to the manifest, this would only succeed on win10 1803+ (the manifest is 1903+ - not the same version), but unlike the manifest, it also runs on XP, and if setlocale failed then no problem - it remains ANSI.

Another difference with runtime setlocale is that argv and environ are already ANSI - unlike with the manifest, so they need to become utf8. Also, apparently, CP_ACP and GetACP() are as if setlocale was never called (but of course, the current locale - which is affected by setlocale - can be determined at runtime).

And, also unlike the manifest, this only works with ucrt. With msvcrt on win 10/11 (and older versions) the setlocale call fails, which is why I mentioned the mingw default change and that ucrt can run on XP as well.

Here's an example of using the runtime setlocale approach: https://github.com/gwsw/less/commit/ad440b6cea6a8ac2c5c26011fe4fe710791b459f

It's relatively little code. More than the manifest (which basically requires no code changes), but not too much. I think it can trivially work in busybox-w32 as well.

A manifest can still be added, which would bypass the initial conversion of argv and environ, but I'm guessing that the goal of this is to support XP up to latest with the same binary, and the manifest breaks XP.

And, we might need to reconsider our position about the same binary being unicode on one system and ansi on another.

Thoughts?

avih commented 7 months ago

Using a utf8 locale does not actually require the utf8 manifest. It can also be set using setlocale(".UTF8"). Similar to the manifest

So, I have a local patch to use this method as FEATURE_UTF8_UCRT, but it seems that it doesn't work fully, or I couldn't make it work, as I expected.

TL;DR: fopen seems to work and does open UTF-8 names correctly, and the docs mention that _mkdir and _getcwd do become UTF-8, and (with this busybox build) the spawn* family does invoke apps with correct unicode args (i.e. seemingly argv is converted correctly to UTF8 by new code in this patch, and spawn uses this UTF8 without further conversion to wide chars), but as far as I can tell Find{First,Next}FileA seem to return question marks instead of non-ASCII codepoints.

So in busybox sh, cat unicode-path, cd unicode-path, pwd, non-applet.exe unicode-arg seem to work OK, but ls doesn't list unicode names correctly.

I've reproduced the fopen and FindFirstFileA cases in a test program, where with setlocale only the former becomes UTF8, while with the UTF8 manifest both seem to become UTF8.

I don't know whether that's a bug of the UTF8 locale or a designed behavior - I don't think there's docs for this locale other than the linked page, and I don't quite get where the line goes between the manifest and the setlocale methods (other than argv, environ, etc).

avih commented 7 months ago

and I don't quite get where the line goes between the manifest and the setlocale methods (other than argv, environ, etc).

Or maybe I do. I'm guessing the setlocale method only applies to msvcrt/ucrt API, like fopen, but indeed not like FindFirstFileA.

Basically, roughly the libc API, which I guess kind of makes sense, but is not very helpful if one needs total UTF8 API, including the ANSI win32 APIs...

ale5000-git commented 3 weeks ago

@rmyorston @avih Wouldn't it be possible to allow to user to enable/disable the interactive UTF-8 editing at runtime using a special BB_* var?

avih commented 3 weeks ago

Please stop it. The answer is no. Just replace the binary instead.

rmyorston / busybox-w32

[discussion] Native unicode support #352