tweag / ormolu

A formatter for Haskell source code
https://ormolu-live.tweag.io
Other
958 stars 83 forks source link

Error with source files containing Unicode #38

Closed mrkkrp closed 5 years ago

mrkkrp commented 5 years ago

When I try to feed source files with Unicode symbols I get the infamous "invalid byte sequence" errors. Investigate what causes this and fix it.

Low-priority for now.

mrkkrp commented 5 years ago

I can't reproduce it anymore.

mrkkrp commented 5 years ago

If anyone encounters this, please re-open.

bidigo commented 4 years ago

This still seems to be a problem, just tested with the newest master. To reproduce:

-- main.hs
main :: IO ()
main = putStrLn "ä"

$ ormolu main.hs
$ ormolu: /home/.../main.hs: hGetContents: invalid argument (invalid byte sequence)
mrkkrp commented 4 years ago

This is always caused by locale, not something in Ormolu.

aspiwack commented 4 years ago

Could we choose to always read files as Utf-8, regardless of the locale?

jinwoo commented 4 years ago

Please +1 this GHC ticket: https://gitlab.haskell.org/ghc/ghc/issues/17755 :)

bidigo commented 4 years ago

After some digging around I found this nixpkgs issue.

I can confirm that on my system (Arch Linux, locale set to en_US.UTF-8) using either of the workarounds mentioned here makes ormolu work without problems on files containing Unicode characters.

So it would seem that ormolu works without workarounds in the following two cases:

  1. On NixOS on any kind of source file, or
  2. On other distributions on source files containing only ASCII characters

If that's true, this issue would affect any non-NixOS user developing internationalized software.

I don't know what's the current state of affairs in GHC regarding to which encoding is used when interpreting source files. If they already switched to using UTF-8, would it make sense for ormolu to follow the same path?

eamsden commented 3 years ago

I can confirm that this is still an issue, it manifests on our CI server. It should be noted that System.IO.readFile is a thin wrapper around hGetContents, so even though there is no explicit call to hGetContents in the codebase the issue still easily manifests.

The simple solution is a wrapper readUtf8Contents that explicitly sets the handle encoding to utf8 before reading. I can make a PR if the maintainers will confirm they would accept this solution.

mboes commented 3 years ago

@mrkkrp WDYT?

mrkkrp commented 3 years ago

This is never a problem for me. I think as long as locale is selected correctly (e.g. with LANG env variable) it should work fine. I'm not against a PR that would force UTF-8 though.

arianvp commented 3 years ago

Can this be re-opened? This is definitely still an issue

mrkkrp commented 3 years ago

In order to reopen this a way to reproduce the problem should be provided.

arianvp commented 3 years ago
-- test.hs
a = "ℤ"
$ LOCALE_ARCHIVE= LC_ALL= ormolu test.hs 
ormolu: test.hs: hGetContents: invalid argument (invalid byte sequence)

Edit: Ah I see your note about forcing UTF-8 now. I completely read over it.

mrkkrp commented 3 years ago

The fact that locate influences how file contents are read is something that affects every program written in Haskell. So it looks like if this is so annoying, we should try to change the default behavior upstream, not try to patch it in every individual application again and again.

See https://gitlab.haskell.org/ghc/ghc/-/issues/17755, I'd love to see that issue moving forward.