haskell / haskeline

A Haskell library for line input in command-line programs.
https://hackage.haskell.org/package/haskeline
BSD 3-Clause "New" or "Revised" License
223 stars 75 forks source link

Support non-BMP characters (or, surrogate pairs) on Windows #125

Closed minoki closed 4 years ago

minoki commented 4 years ago

Currently, haskeline does not properly handle surrogate pairs on Windows. This leads to issues like


This PR consists of three parts:

The String must be UTF-16 encoded when calling WriteConsoleW. Otherwise, the program crashes at toEnum here.

Note that my patch doesn't care WriteConsoleW's buffer limit. Actually, I tried running

main = runInputT defaultSettings $ outputStrLn $ replicate 20000 '\x1F986'

on my Windows machine (Win10 Pro 1903), but WriteConsoleW seems to have succeeded. If problem arises on older Windows, the patch may need to be reconsidered.

Windows sends two input events for a non-BMP character: lead surrogate, trail surrogate. So we need to decode them.

Since haskeline calls wcwidth on the prompt string, wcwidth must also be able to handle non-BMP characters. Otherwise, a program like

main = runInputT defaultSettings $ do
    _ <- getInputLine "\x1F986"
    return ()

crashes at toEnum here.

The fix is just changing wchar_t/CWchar to int/CInt when interfacing with the C counterpart (haskeline_mk_wcwidth).

Other C functions in h_wcwidth.c (haskeline_mk_cwswidth, haskeline_mk_wcwidth_cjk, haskeline_mk_wcswidth_cjk) are not modified, because they seem to be unused.

judah commented 4 years ago

Thank you for this fix! I don't have an easily-accessible Windows machine to try it on myself, but the changes all look very reasonable.