smlnj / legacy

This project is the old version of Standard ML of New Jersey that continues to support older systems (e.g., 32-bit machines).
BSD 3-Clause "New" or "Revised" License
25 stars 10 forks source link

Windows installation does not handle Unicode filenames #306

Open Skyb0rg007 opened 3 months ago

Skyb0rg007 commented 3 months ago

Version

110.99.4 (Latest)

Operating System

OS Version

Windows 11 Pro

Processor

System Component

Core system

Severity

Minor

Description

On Windows, SML is unable to open files with Unicode filenames:

# PowerShell
> "hello" | Out-File -LiteralPath "foo`u{d83d}`u{de4f}.txt";
> "hello" | Out-File -LiteralPath "bar`u{03BB}.txt";

(* SML *)
- val dir = OS.FileSys.openDir ".";
- OS.FileSys.readDir dir;
val it = SOME "foo??.txt" : string option
- OS.FileSys.readDir dir;
val it = SOME "bar?.txt" : string option
- TextIO.openIn ("foo" ^ UTF8.encode 0wx1F64F ^ ".txt");
uncaught exception Io [Io: openIn failed on "foo🙏.txt", Win32TextPrimIO.openRd: failed]
raised at: Basis/Implementation/IO/text-io-fn.sml:792.25-792.71
- TextIO.openIn ("foo" ^ UTF8.encode 0wxD83D ^ UTF8.encode 0wxDE4F ^ ".txt");
uncaught exception Io [Io: openIn failed on "foo🙏.txt", Win32TextPrimIO.openRd: failed]
raised at: Basis/Implementation/IO/text-io-fn.sml:792.25-792.71
- TextIO.openIn ("bar" ^ UTF8.encode 0wx03BB ^ ".txt");
uncaught exception Io [Io: openIn failed on "bar╬╗.txt", Win32TextPrimIO.openRd: failed]
raised at: Basis/Implementation/IO/text-io-fn.sml:792.25-792.71

Transcript

See above

Expected Behavior

OS.FileSys.readDir should not return paths to files that don't exist, and instead return the path of a file that exists. TextIO.openIn should be able to open every file that exists on a system.

Steps to Reproduce

  1. Install on Windows using the .msi
  2. Run the PowerShell line from above, or use your favorite programming language or copy-paste to create a file with a Unicode filename.
  3. Use any of the system APIs and try to access that file.

Additional Information

I believe the issue is that the Win32 APIs are not compiled with the UNICODE macro defined, as OS.FileSys.readDir is implemented using FindFirstFile which is macro-expanded to different versions based on the presence of this macro.

The minwinbase.h header defines WIN32_FIND_DATA as an alias which automatically selects the ANSI or Unicode version of this function based on the definition of the UNICODE preprocessor constant. Mixing usage of the encoding-neutral alias with code that not encoding-neutral can lead to mismatches that result in compilation or runtime errors. For more information, see Conventions for Function Prototypes.

Email address

skyler DOT soss AT gmail.com

Edit: Added example that didn't use UTF16 surrogates, since this is not a UTF16-specific issue.

Skyb0rg007 commented 3 months ago

I wrote a PowerShell script here that exhibits the bug. I believe that the gobbled error messages in the examples above is due to the Console encoding, and changing the Console to UTF8 fixes those error messages. It does not fix the underlying issue however.

JohnReppy commented 3 months ago

This is not an easy fix, since we do not currently support wide characters/strings in SML/NJ. As I understand it, the Windows Unicode APIs assume UTF-16 encoding of file names. Perhaps we can convert between UTF-8 strings and UTF-16 in the runtime system.

Skyb0rg007 commented 3 months ago

I believe the standard method for doing this is the MultiByteToWideChar function. Calling it at API boundaries could be a good fix, though it wouldn't completely fix the issue (due to esoteric reasons involving UCS-2 filenames which are not valid UTF16. That's more of a SML issue than an SML/NJ issue though).