fzf on Windows assumes UTF8 stdin, but it should assume CP_ACP stdin (current system codepage)

junegunn / fzf

:cherry_blossom: A command-line fuzzy finder

https://junegunn.github.io/fzf/

MIT License

65.81k stars 2.41k forks source link

fzf on Windows assumes UTF8 stdin, but it should assume CP_ACP stdin (current system codepage) #4065

Open chrisant996 opened 1 month ago

chrisant996 commented 1 month ago

Checklist

[X] I have read through the manual page (man fzf)
[X] I have searched through the existing issues
[X] For bug reports, I have checked if the bug is reproducible in the latest version of fzf

Output of `fzf --version`

0.55.0 (fc69308)

OS

[ ] Linux
[ ] macOS
[X] Windows
[ ] Etc.

Shell

[ ] bash
[ ] zsh
[ ] fish

Problem / Steps to reproduce

Use any system codepage other than 65001 (which is Beta and is never the default).
Run dir | fzf when any file names are present with non-ASCII characters.

Result = fzf interprets the stdin as UTF8, but the stdin is never UTF8 by default, and so fzf shows garbled text. Expected = fzf interprets the stdin as being in the system codepage by default, like native Windows programs do.

Something in fzf is going out of its way to assume UTF8 stdin on Windows, but that isn't how Windows programs work.

Ideally fzf would assume system codepage by default, and could have something like a --utf8 flag to select assuming UTF8 stdin.

junegunn commented 1 month ago

Thanks for the pointers.

Here's the background.

fzf was originally written for *nix systems only using ncurses.
Windows support was added by @kelleyma49 using the tcell library. We still use it on Windows in full-screen mode (https://github.com/junegunn/fzf/blob/master/src/tui/tcell.go), and I believe this implementation properly handles I/O using native Windows system calls.
fzf added --height option to enable non-full-screen mode. We added "LightRenderer" which reads from and writes to /dev/tty directly. No Windows support for --height at the moment.
@kelleyma49 sent a patch to enable --height option on Windows. From my understanding, it changed the LightRenderer to read from CONIN$ and write to CONOUT$ instead of /dev/tty on Windows. But it's a thin adapter, and as you mentioned, it assumes that the same byte sequences as *nix systems are used.
I wanted to get rid of tcell-based renderer and use LightRenderer even in full-screen mode to simplify the code and for better mouse support on mintty (dca2262fe6f8d7c4a9264464e45d44e684d0e70b)
Then we realized that LightRenderer cannot properly read non-ASCII characters, and had to revert the decision.
@kelleyma49 confirmed when the code page is set to 65001, fzf using LightRenderer (--height) can properly read non-ASCII characters on Windows Terminal (#3951). The patch to set the code page to 65001 has not been merged yet. I'm not quite sure if it's the right way to fix the problem.

Because I'm not a Windows user and have a limited understanding of the system, I don't think I'll be able to fix the problem myself. I'm hoping that someone will contribute.

chrisant996 commented 1 month ago

it changed the LightRenderer to read from CONIN$ and write to CONOUT$ instead of /dev/tty on Windows. But it's a thin adapter, and as you mentioned, it assumes that the same byte sequences as *nix systems are used.

Yup, that'll only work when the console code page is 65001 (UTF8).

But I don't know whether that can handle function keys, arrow keys, and other extended keys, etc. I've never tried doing file IO reads from CONIN$ when reading keyboard input, since Windows hasn't had VT input emulation until very recently (partway thru the Win10 lifecycle), and programs typically want to be backward compatible with older versions of Windows.

The normal approach for reading keyboard input is to use ReadConsoleInputW on CONIN$, which returns a struct with details about a keyboard/mouse/resize/etc event (depending on console mode flags on the conin handle). On Win11 with the right console mode flag set on the console input handle, then I expect using ReadFile (like the patch currently does) would work. If you want to support older versions of Windows (which in my opinion is probably NOT necessary for fzf), then it would be necessary to use ReadConsoleInputW instead -- but I think that's probably not worth implementing.

@kelleyma49 confirmed when the code page is set to 65001, fzf using LightRenderer (--height) can properly read non-ASCII characters on Windows Terminal (#3951). The patch to set the code page to 65001 has not been merged yet. I'm not quite sure if it's the right way to fix the problem.

As long as it's careful to intercept Ctrl-Break and Ctrl-C and restore the original code page, then that should be ok.

Or, there's a chance that maybe each console input handle has separate mode bits (SetConsoleMode). If so, then it might not even be necessary to restore the console mode. I'll run some tests to find out, and reply here with the results.

UPDATE: Yes, any console modes and the console code page must be restored before fzf exits. If fzf isn't already intercepting Ctrl-Break/Ctrl-C and restoring the mode(s) and CP, then let me know and I can share how to hook that up.

Because I'm not a Windows user and have a limited understanding of the system, I don't think I'll be able to fix the problem myself. I'm hoping that someone will contribute.

Understood. I don't know Go at all, but I have decades of deep knowledge of Win32 APIs and system design, and C/C++ experience. So at the very least, I can offer "consultation services".

There are multiple ways to support Unicode keyboard input, so I think that can get solved soon, and probably using SetConsoleCP is sufficient, as long as the original CP is restored before fzf exits.

But I'm even more interested in the stdin encoding issue. It's unrelated to keyboard input, and addressing it should just be a matter of parsing line breaks, and then for each line call MultiByteToWideChar once from CP_ACP, then call WideCharToMultiByte once using CP_UTF8.

I'm willing to try to learn Go enough to at least write some pseudocode, or maybe even get it compiling and testable. Can someone share a code pointer to where stdin is read?

Windows console (text mode) programs write 8-bit output encoded in ACP (GetACP()). It's problematic to require callers to externally pipe stdout into some conversation tool, and pipe that into fzf stdin -- partly because Windows doesn't have such a tool (there's no built-in analog to iconv, though various third party tools exist). It would be very beneficial if fzf could (on Windows) default to expecting stdin encoded as ACP, and internally convert it to UTF8. And since some programs have a special flag such as --utf8 to force output as UTF8, it would be great for fzf to have a flag --utf8 that reverts to the current behavior where it assumes all input is UTF8. Ideally fzf would do the same conversion in reverse on its stdout as well, also subject to the --utf8 flag.

chrisant996 commented 1 month ago

@junegunn just to be clear -- I opened this issue about stdin encoding, not about keyboard input encoding. They're separate things on Windows.

chrisant996 commented 1 month ago

Also, a caveat ... I said stdin/stdout use GetACP() encoding on Windows.

That's an oversimplification and is a bit inaccurate.

In reality console mode apps should emit stdout using the GetConsoleOutputCP() encoding. But since GUI apps must use GetACP() instead of GetConsoleOutputCP(), some console mode apps mistakenly use GetACP().

The OEM codepages are very similar to normal codepages, but have some different legacy characters. E.g. 1252 is the English codepage but 437 is the English OEM codepage (I'm oversimplifying, but GetConsoleOutputCP() reports the actual codepage, whatever it is).

GetConsoleOutputCP() should be the right general default, especially since cmd.exe's dir command uses GetConsoleOutputCP(), not GetACP(). And the dir command is the obvious choice when collecting filenames for completion on Windows (since dir is built into the OS cmd.exe command shell, and doesn't require any additional programs to be installed).

So...

The default should be to use GetConsoleOutputCP() as the encoding for both reading from stdin, and writing to stdout.
Ideally some --utf8 flag would override the default and use CP_UTF8 instead.
It might also be useful to have a way to manually override the codepage (esp. for using GetACP() and/or CP_ACP in the MultiByteToWideChar and WideCharToMultiByte functions).

Does that sound like a reasonable design?

skarasov commented 1 month ago

Win 10, pwsh 7.4.6, fzf 0.56.0

As workaround use PowerShell 😊 First pic - beta utf8 support turned off. Second pic - turned on.

chrisant996 commented 4 weeks ago

Win 10, pwsh 7.4.6, fzf 0.56.0

As workaround use PowerShell

Powershell does not by default use UTF8 as the IO encoding.

You must have configured your Powershell to set $OutputEncoding to UTF8.

Yes, there are workarounds. But it's reasonable for me to help update fzf to support encodings the same way native Windows programs do. Then things will just work, without needing to find/use workarounds (which may be different for different programs).

Default encoding in Powershell is not UTF8:

Yes it's possible to force Powershell to use UTF8 (or some other encoding).

skarasov commented 4 weeks ago

In PowerShell 7.4.6 (not Windows PowerShell 😒) $OutputEncoding defaults to utf-8.

And, imho, stop feeding Microsoft legacy.

chrisant996 commented 4 weeks ago

In PowerShell 7.4.6 (not Windows PowerShell 😒) $OutputEncoding defaults to utf-8.

I thought Powershell 7.4.6 is not installed by default, right? So doesn't the suggested workaround require installing an additional program?

And, imho, stop feeding Microsoft legacy.

That doesn't feel very empathetic to people who use Windows. I'm trying to support users and get things working automatically. It doesn't come off well to suggest that they should just quit using Microsoft products, or do a bunch of workarounds (which are different for each program that pipes output into fzf).

Why can't we make it Just Work?

skarasov commented 4 weeks ago

I thought Powershell 7.4.6 is not installed by default, right?

fzf is not installed by default, right?

very empathetic to people ... support users ... doesn't come off well

Not a word about users from my side. Don't misinterpret. Intention - stop feeding legacy - sooner UTF-8 will be default in Win.

Why can't we make it Just Work?

Because of this we more than 15 years struggling with different codepages rather one.

And just chcp 65001 before dir | fzf.

chrisant996 commented 4 weeks ago

I thought Powershell 7.4.6 is not installed by default, right?

fzf is not installed by default, right?

I have users of Clink who install fzf. And then they're surprised that Fzf doesn't work with non-ASCII content by default, they have to either install extra programs or do a bunch of research and workarounds to be able to use fzf.

Out of the box, fzf doesn't play very well with Windows. I'm offering to junegunn to help make fzf work better out of the box on Windows. Your suggestions are beside the point of the issue. I appreciate them, and may use some of them, but I'd like to keep the discussion focused on the stated issue.

very empathetic to people ... support users ... doesn't come off well

Not a word about users from my side. Don't misinterpret. Intention - stop feeding legacy - sooner UTF-8 will be default in Win.

I understand your viewpoint.

Why can't we make it Just Work?

Because of this we more than 15 years struggling with different codepages rather one.

And just chcp 65001 before dir | fzf.

I understand that you disagree with making any change to fzf, and that changes should be external to fzf for Windows, to try to change how Windows works.

That doesn't mean junegunn and I cannot make fzf work seamlessly on Windows. It only means you prefer not to.

chrisant996 commented 4 weeks ago

If junegunn is not willing, then I'll drop the topic. But I'd like to have the conversation about what it would take to do it. I think it's a pretty easy change to make.

I'll grant that maybe the standalone Powershell using UTF8 by default means fzf should continue to default to expecting UTF8. But adding a flag to accept the system code page (or even arbitrary encodings) is not unreasonable. There are plenty of Linux programs that explicitly have flags to control both input and output encodings. The concept is very compatible with Linux.

junegunn commented 4 weeks ago

Thanks for the detailed comment. I need some time to process this, as the whole issue is quite new to me.

The default should be to use GetConsoleOutputCP() as the encoding for both reading from stdin, and writing to stdout.

I agree. fzf should respect the current setting of the terminal.

Ideally some --utf8 flag would override the default and use CP_UTF8 instead.

It might also be useful to have a way to manually override the codepage (esp. for using GetACP() and/or CP_ACP in the MultiByteToWideChar and WideCharToMultiByte functions).

I'm not sure if I follow you here. When would a user want to use --utf8? Can't they just set chcp 65001 if they want fzf to process UTF-8 entries? Or, if a source program emits a UTF-8 stream, but the current terminal is not set to use UTF-8, I think a separate program can step in, something like iconv.

dir | iconv -f utf-8 -t xxx | fzf

junegunn commented 4 weeks ago

I agree. fzf should respect the current setting of the terminal.

Having said that, I can understand @skarasov's reluctance. We need to assess the amount of code needed to support non-UTF-8 settings and decide if it's worth the effort.

chrisant996 commented 4 weeks ago

I think the reason it hasn't been much of an issue so far is that cmd.exe doesn't support customizable completions, so there isn't really a way to even use fzf with cmd.exe.

But Clink hooks system APIs inside cmd.exe to add bash-like/zsh-like/fish-like prompt customization and command line input enhancements (such as customizable completions).

So, fzf integration is possible for cmd.exe plus Clink. English users are unlikely to notice the UTF8 encoding issue, since the majority of output is ASCII to begin with and UTF8 is a superset of ASCII.

It's mainly the intersection of non-English users + Clink users + fzf users who get hit with the quirky complexity and need for workarounds.

I agree. fzf should respect the current setting of the terminal.

Having said that, I can understand @skarasov's reluctance. We need to assess the amount of code needed to support non-UTF-8 settings and decide if it's worth the effort.

I agree. I expect it to be small code changes. I don't feel it would be reasonable to add larger code changes for it.

The claim is that the cross-platform Powershell defaults to UTF8 stdout on Windows. I haven't verified that, but I'm happy to accept it. And then that implies that fzf would need to continue to assume UTF8 stdin to avoid breaking existing usage of fzf with the cross-platform Powershell (and any workarounds that users have already applied).

So, maybe a --acp flag or --system-code-page flag could apply conversion from the system code page to UTF8. (Well, it's not really "ACP", it's the Console Output CP, but there's no established abbreviation for that, and "acp" is next closest thing. But "acp" could be misleading/confusing. I don't want to get mired in choosing a specific name just yet, though. 😉)

Can't they just set chcp 65001 if they want fzf to process UTF-8 entries? Or, if a source program emits a UTF-8 stream, but the current terminal is not set to use UTF-8, I think a separate program can step in, something like iconv.

chcp 65001 changes the code page for the rest of the terminal session, not just one program in the session (it's tied to the console/terminal session, not to a given shell or cmd.exe session). It affects much more than just the next fzf invocation, and can cause malfunctions or unexpected behavior later on in other console mode programs. It's possible to first use chcp to remember the original codepage, then chcp 65001 to force UTF8, then chcp {original} again. But that's quite cumbersome and requires parsing the chcp output, and an ill-timed Ctrl-C can skip restoring the codepage. And an average user can't do that kind of scripting, much less fit it into an FZF_CTRL_T_COMMAND environment variable. Also, it's a fragile approach for any program to change a global shared setting, and then restore it, just to accomplish its own task -- it's inherently possible to introduce unwanted side effects.

Windows doesn't have an iconv tool built-in. My goal is for using fzf on Windows to not depend on installing additional programs, and to not depend on stringing together chcp-style workarounds, especially in cases where it's important to follow up with another chcp to restore the original codepage. It makes fzf stand out from other programs as being complicated and quirky on Windows.

I'm working on coming up with some kind of reliable workaround by default in the Clink fzf integration script, but the need to save/restore the original chcp codepage is making it rather complicated and fragile.

When would a user want to use --utf8?

The --utf8 suggestion (or other flags to select specific encodings) was because some programs have flags to force UTF8 output (or even UTF16 output, but let's not get into that) to be able to represent characters outside the current system codepage. And there is a small number of programs, like perhaps the cross-platform version of Powershell, which default to UTF8 output and have flags/configuration to choose a different output encoding.

Something like fzf --acp or fzf --system-code-page is a least-impact approach, and doesn't involve any changes to fzf's default behavior.

chrisant996 commented 4 weeks ago

I was able to make the Clink fzf integration Lua script apply a workaround that automatically applies to whatever customizations the user has made. So that users hopefully don't need to trouble themselves with how to construct safe/reliable custom workaround scripts.

It first spawns chcp to get the current codepage.
Then it spawns chcp 65001 to force UTF8. This can have side effects on user customizations -- depending on the customization the side effects could be beneficial or harmful, but hopefully they're often beneficial.
Then it spawns the configured command per FZF_CTRL_T_COMMAND or FZF_ALT_C_COMMAND and pipes the output into fzf. And while fzf is running, any configured commands (such as for previews) end up running with chcp set to 65001 as well. Which may or may not always be beneficial, but hopefully it mostly works out ok.
Once fzf exits, then it spawns chcp original_code_page to restore the codepage.

The chances of failing to restore the codepage are low, so the workaround should generally be reasonably reliable/safe.

So, arguably, this means fzf doesn't need any changes to play well on Windows, at least not when using the combination of cmd.exe + fzf + Clink + the clink-fzf integration script. It's not a perfect workaround, but it should be mostly sufficient.

Konfekt commented 3 weeks ago

Just wondering if instead of Fzf changing the codepage (for good!?) the Powershell integration script could not, like you Clink integration script, temporarily set and (reliably) restore the code page for legacy Powershell (< 6 !?) as well?

I also wonder about the harm that chcp 65001 in cmd could possibly do. Anecdotally, it made non-ASCII function names in Excel VBA stop working (and this took a while to find out, so it's not a system-wide default for a reason), but restricted to a cmd instance the blast radius could be rather reduced.

chrisant996 commented 3 weeks ago

Just wondering if instead of Fzf changing the codepage (for good!?) the Powershell integration script could not, like you Clink integration script, temporarily set and (reliably) restore the code page for legacy Powershell (< 6 !?) as well?

Save/set/restore is definitely (much) safer than simply setting it and leaving it set. But it's still a hacky workaround with a wider blast radius than it might appear at first glance. See next...

I also wonder about the harm that chcp 65001 in cmd could possibly do. Anecdotally, it made non-ASCII function names in Excel VBA stop working (and this took a while to find out, so it's not a system-wide default for a reason), but restricted to a cmd instance the blast radius could be rather reduced.

Yes. If it could be restricted to a process, then the issues would be easy to solve.

But chcp and SetConsoleOutputCP affect the console itself, not just one process. They affect all processes currently attached to the console (i.e. running within the terminal window) plus all future processes that ever run in that console (i.e. terminal session) -- until the original cp is restored.

This can have non-obvious side effects. For example, suppose a prompt customization script spawns a background console mode program (as an analogy consider async prompt updating like some zsh prompt customizations do, and which Clink also does). And suppose that while that's running in the background, the user invokes file completion which spawns chcp 65001 & dir | fzf & chcp original_cp (that's pseudo code; the actual syntax is much more complicated). The chcp changes the output codepage for the entire console, including the unrelated background process that was started earlier. So it can break the output from the background process, and break whatever depends on that output, until the cp is restored (or indefinitely, if not restored).

Calling chcp 65001 looks like an appealing workaround at first glance, but it's a hacky workaround and it has race conditions and can cause downstream problems later on, even in completely unrelated programs. It's unlikely a user will even be aware that chcp 65001 is happening, and users will typically not be able to draw a connection between the seemingly random intermittent malfunctions and the fact that they used fzf for filename completion in the shell at some point.

junegunn / fzf

fzf on Windows assumes UTF8 stdin, but it should assume CP_ACP stdin (current system codepage) #4065

Checklist

Output of fzf --version

OS

Shell

Problem / Steps to reproduce

Output of `fzf --version`