Error when using cpphs in some locale environments

asr commented 8 years ago

Some Agda users have reported an error when installing Agda in their locale environments.

A MWE (adapted from this example) is the following:

$ cat Test.hs
module Main where

main = putStrLn "∀"

$ LC_CTYPE=C cpphs Test.hs > /dev/null
cpphs: Test.hs: hGetContents: invalid argument (invalid byte sequence)

@nad wrote here:

I guess that cpphs uses the standard, locale-aware methods to read files. I think all of our source files use the UTF-8 character encoding, so the problem can perhaps be solved by setting LC_CTYPE to .UTF-8 before invoking cpphs, for some locale .UTF-8 that is installed. However, I would not be surprised if it is impossible to do this in a system-independent way. Perhaps it would be better to add a --utf8 flag to cpphs.

Blocking https://github.com/agda/agda/issues/2112.

malcolmwallace commented 8 years ago

I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.

$ LC_CTYPE=C ./cpphs Test.hs 
#line 1 "Test.hs"
module Main where

main = putStrLn "∀"

asr commented 8 years ago

Using the file command for determining the file type I got

$ file Test.hs
Test.hs: UTF-8 Unicode text

What do you get?

malcolmwallace commented 8 years ago

The same.

asr commented 8 years ago

It seems you have no the C locale installed. Which is the output of running

$ locale -a

?

malcolmwallace commented 8 years ago

$ locale -a af_ZA af_ZA.ISO8859-1 af_ZA.ISO8859-15 af_ZA.UTF-8 am_ET am_ET.UTF-8 be_BY be_BY.CP1131 be_BY.CP1251 be_BY.ISO8859-5 be_BY.UTF-8 bg_BG bg_BG.CP1251 bg_BG.UTF-8 ca_ES ca_ES.ISO8859-1 ca_ES.ISO8859-15 ca_ES.UTF-8 cs_CZ cs_CZ.ISO8859-2 cs_CZ.UTF-8 da_DK da_DK.ISO8859-1 da_DK.ISO8859-15 da_DK.UTF-8 de_AT de_AT.ISO8859-1 de_AT.ISO8859-15 de_AT.UTF-8 de_CH de_CH.ISO8859-1 de_CH.ISO8859-15 de_CH.UTF-8 de_DE de_DE.ISO8859-1 de_DE.ISO8859-15 de_DE.UTF-8 el_GR el_GR.ISO8859-7 el_GR.UTF-8 en_AU en_AU.ISO8859-1 en_AU.ISO8859-15 en_AU.US-ASCII en_AU.UTF-8 en_CA en_CA.ISO8859-1 en_CA.ISO8859-15 en_CA.US-ASCII en_CA.UTF-8 en_GB en_GB.ISO8859-1 en_GB.ISO8859-15 en_GB.US-ASCII en_GB.UTF-8 en_IE en_IE.UTF-8 en_NZ en_NZ.ISO8859-1 en_NZ.ISO8859-15 en_NZ.US-ASCII en_NZ.UTF-8 en_US en_US.ISO8859-1 en_US.ISO8859-15 en_US.US-ASCII en_US.UTF-8 es_ES es_ES.ISO8859-1 es_ES.ISO8859-15 es_ES.UTF-8 et_EE et_EE.ISO8859-15 et_EE.UTF-8 eu_ES eu_ES.ISO8859-1 eu_ES.ISO8859-15 eu_ES.UTF-8 fi_FI fi_FI.ISO8859-1 fi_FI.ISO8859-15 fi_FI.UTF-8 fr_BE fr_BE.ISO8859-1 fr_BE.ISO8859-15 fr_BE.UTF-8 fr_CA fr_CA.ISO8859-1 fr_CA.ISO8859-15 fr_CA.UTF-8 fr_CH fr_CH.ISO8859-1 fr_CH.ISO8859-15 fr_CH.UTF-8 fr_FR fr_FR.ISO8859-1 fr_FR.ISO8859-15 fr_FR.UTF-8 he_IL he_IL.UTF-8 hi_IN.ISCII-DEV hr_HR hr_HR.ISO8859-2 hr_HR.UTF-8 hu_HU hu_HU.ISO8859-2 hu_HU.UTF-8 hy_AM hy_AM.ARMSCII-8 hy_AM.UTF-8 is_IS is_IS.ISO8859-1 is_IS.ISO8859-15 is_IS.UTF-8 it_CH it_CH.ISO8859-1 it_CH.ISO8859-15 it_CH.UTF-8 it_IT it_IT.ISO8859-1 it_IT.ISO8859-15 it_IT.UTF-8 ja_JP ja_JP.SJIS ja_JP.UTF-8 ja_JP.eucJP kk_KZ kk_KZ.PT154 kk_KZ.UTF-8 ko_KR ko_KR.CP949 ko_KR.UTF-8 ko_KR.eucKR lt_LT lt_LT.ISO8859-13 lt_LT.ISO8859-4 lt_LT.UTF-8 nl_BE nl_BE.ISO8859-1 nl_BE.ISO8859-15 nl_BE.UTF-8 nl_NL nl_NL.ISO8859-1 nl_NL.ISO8859-15 nl_NL.UTF-8 no_NO no_NO.ISO8859-1 no_NO.ISO8859-15 no_NO.UTF-8 pl_PL pl_PL.ISO8859-2 pl_PL.UTF-8 pt_BR pt_BR.ISO8859-1 pt_BR.UTF-8 pt_PT pt_PT.ISO8859-1 pt_PT.ISO8859-15 pt_PT.UTF-8 ro_RO ro_RO.ISO8859-2 ro_RO.UTF-8 ru_RU ru_RU.CP1251 ru_RU.CP866 ru_RU.ISO8859-5 ru_RU.KOI8-R ru_RU.UTF-8 sk_SK sk_SK.ISO8859-2 sk_SK.UTF-8 sl_SI sl_SI.ISO8859-2 sl_SI.UTF-8 sr_YU sr_YU.ISO8859-2 sr_YU.ISO8859-5 sr_YU.UTF-8 sv_SE sv_SE.ISO8859-1 sv_SE.ISO8859-15 sv_SE.UTF-8 tr_TR tr_TR.ISO8859-9 tr_TR.UTF-8 uk_UA uk_UA.ISO8859-5 uk_UA.KOI8-U uk_UA.UTF-8 zh_CN zh_CN.GB18030 zh_CN.GB2312 zh_CN.GBK zh_CN.UTF-8 zh_CN.eucCN zh_HK zh_HK.Big5HKSCS zh_HK.UTF-8 zh_TW zh_TW.Big5 zh_TW.UTF-8 C POSIX

malcolmwallace commented 8 years ago

I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1

asr commented 8 years ago

You have the C locale installed. I could reproduce the issue compiling cpphs with GHC 7.6.3. What shell are you using? I'm using

$ echo $SHELL
/bin/bash

nad commented 8 years ago

cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?

I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.

nad commented 8 years ago

A simple (system-dependent) test:

$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)

nad commented 8 years ago

Certainly, setting LC_CTYPE does not seem to change its behaviour.

Perhaps you've set LC_ALL, which overrides LC_CTYPE.

malcolmwallace commented 8 years ago

$ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.4 $ cat test ∀ $ file test test: UTF-8 Unicode text $ ghc -e 'putStr =<< readFile "test"' ∀ $ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"' ∀ $ LC_ALL=C ghc -e 'putStr =<< readFile "test"' ∀

malcolmwallace commented 8 years ago

I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.

asr commented 8 years ago

Which operating system and shell are you using?

asr commented 8 years ago

Could you reproduce the issue running

$ export LC_ALL=C
$ cpphs test

?

malcolmwallace commented 8 years ago

ghc-7.6.1 on MacOSX 10.7.5, with bash. ghc-7.8.3 on Windows 7 Professional SP1, with bash.

malcolmwallace commented 8 years ago

Cannot reproduce the issue, even with LC_ALL=C.

asr commented 8 years ago

Did you mean export LC_ALL=C?

asr commented 8 years ago

Which is the output of

$ locale
$ LC_ALL=C locale

?

malcolmwallace commented 8 years ago

$ locale # MacOSX LANG="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_CTYPE="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_ALL= $ LC_ALL=C locale LANG="en_GB.UTF-8" LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C"

The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.

nad commented 8 years ago

ghc-7.6.1 on MacOSX 10.7.5, with bash. ghc-7.8.3 on Windows 7 Professional SP1, with bash.

I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.

Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252 would work.

GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".

I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:

The command-line arguments are treated as before.
The encoding of stderr is only changed in the top-level module. If cpphs is intended to be used as a library, and error messages can contain non-ASCII characters, then the encoding of stderr should perhaps be changed in the applicable library modules.

I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.

asr commented 8 years ago

FYI, I reported here the different behaviour in Linux and Mac OS.

malcolmwallace commented 8 years ago

Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.

asr commented 8 years ago

Thanks for fixing the issue (tested on Agda). Could you release a new version, please.

malcolmwallace commented 8 years ago

cpphs-1.20.2 released.

malcolmwallace / cpphs

Error when using cpphs in some locale environments #6