Closed asr closed 8 years ago
I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.
$ LC_CTYPE=C ./cpphs Test.hs
#line 1 "Test.hs"
module Main where
main = putStrLn "∀"
Using the file
command for determining the file type I got
$ file Test.hs
Test.hs: UTF-8 Unicode text
What do you get?
The same.
It seems you have no the C
locale installed. Which is the output of running
$ locale -a
?
$ locale -a af_ZA af_ZA.ISO8859-1 af_ZA.ISO8859-15 af_ZA.UTF-8 am_ET am_ET.UTF-8 be_BY be_BY.CP1131 be_BY.CP1251 be_BY.ISO8859-5 be_BY.UTF-8 bg_BG bg_BG.CP1251 bg_BG.UTF-8 ca_ES ca_ES.ISO8859-1 ca_ES.ISO8859-15 ca_ES.UTF-8 cs_CZ cs_CZ.ISO8859-2 cs_CZ.UTF-8 da_DK da_DK.ISO8859-1 da_DK.ISO8859-15 da_DK.UTF-8 de_AT de_AT.ISO8859-1 de_AT.ISO8859-15 de_AT.UTF-8 de_CH de_CH.ISO8859-1 de_CH.ISO8859-15 de_CH.UTF-8 de_DE de_DE.ISO8859-1 de_DE.ISO8859-15 de_DE.UTF-8 el_GR el_GR.ISO8859-7 el_GR.UTF-8 en_AU en_AU.ISO8859-1 en_AU.ISO8859-15 en_AU.US-ASCII en_AU.UTF-8 en_CA en_CA.ISO8859-1 en_CA.ISO8859-15 en_CA.US-ASCII en_CA.UTF-8 en_GB en_GB.ISO8859-1 en_GB.ISO8859-15 en_GB.US-ASCII en_GB.UTF-8 en_IE en_IE.UTF-8 en_NZ en_NZ.ISO8859-1 en_NZ.ISO8859-15 en_NZ.US-ASCII en_NZ.UTF-8 en_US en_US.ISO8859-1 en_US.ISO8859-15 en_US.US-ASCII en_US.UTF-8 es_ES es_ES.ISO8859-1 es_ES.ISO8859-15 es_ES.UTF-8 et_EE et_EE.ISO8859-15 et_EE.UTF-8 eu_ES eu_ES.ISO8859-1 eu_ES.ISO8859-15 eu_ES.UTF-8 fi_FI fi_FI.ISO8859-1 fi_FI.ISO8859-15 fi_FI.UTF-8 fr_BE fr_BE.ISO8859-1 fr_BE.ISO8859-15 fr_BE.UTF-8 fr_CA fr_CA.ISO8859-1 fr_CA.ISO8859-15 fr_CA.UTF-8 fr_CH fr_CH.ISO8859-1 fr_CH.ISO8859-15 fr_CH.UTF-8 fr_FR fr_FR.ISO8859-1 fr_FR.ISO8859-15 fr_FR.UTF-8 he_IL he_IL.UTF-8 hi_IN.ISCII-DEV hr_HR hr_HR.ISO8859-2 hr_HR.UTF-8 hu_HU hu_HU.ISO8859-2 hu_HU.UTF-8 hy_AM hy_AM.ARMSCII-8 hy_AM.UTF-8 is_IS is_IS.ISO8859-1 is_IS.ISO8859-15 is_IS.UTF-8 it_CH it_CH.ISO8859-1 it_CH.ISO8859-15 it_CH.UTF-8 it_IT it_IT.ISO8859-1 it_IT.ISO8859-15 it_IT.UTF-8 ja_JP ja_JP.SJIS ja_JP.UTF-8 ja_JP.eucJP kk_KZ kk_KZ.PT154 kk_KZ.UTF-8 ko_KR ko_KR.CP949 ko_KR.UTF-8 ko_KR.eucKR lt_LT lt_LT.ISO8859-13 lt_LT.ISO8859-4 lt_LT.UTF-8 nl_BE nl_BE.ISO8859-1 nl_BE.ISO8859-15 nl_BE.UTF-8 nl_NL nl_NL.ISO8859-1 nl_NL.ISO8859-15 nl_NL.UTF-8 no_NO no_NO.ISO8859-1 no_NO.ISO8859-15 no_NO.UTF-8 pl_PL pl_PL.ISO8859-2 pl_PL.UTF-8 pt_BR pt_BR.ISO8859-1 pt_BR.UTF-8 pt_PT pt_PT.ISO8859-1 pt_PT.ISO8859-15 pt_PT.UTF-8 ro_RO ro_RO.ISO8859-2 ro_RO.UTF-8 ru_RU ru_RU.CP1251 ru_RU.CP866 ru_RU.ISO8859-5 ru_RU.KOI8-R ru_RU.UTF-8 sk_SK sk_SK.ISO8859-2 sk_SK.UTF-8 sl_SI sl_SI.ISO8859-2 sl_SI.UTF-8 sr_YU sr_YU.ISO8859-2 sr_YU.ISO8859-5 sr_YU.UTF-8 sv_SE sv_SE.ISO8859-1 sv_SE.ISO8859-15 sv_SE.UTF-8 tr_TR tr_TR.ISO8859-9 tr_TR.UTF-8 uk_UA uk_UA.ISO8859-5 uk_UA.KOI8-U uk_UA.UTF-8 zh_CN zh_CN.GB18030 zh_CN.GB2312 zh_CN.GBK zh_CN.UTF-8 zh_CN.eucCN zh_HK zh_HK.Big5HKSCS zh_HK.UTF-8 zh_TW zh_TW.Big5 zh_TW.UTF-8 C POSIX
I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1
You have the C
locale installed. I could reproduce the issue compiling cpphs
with GHC 7.6.3. What shell are you using? I'm using
$ echo $SHELL
/bin/bash
cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?
I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.
A simple (system-dependent) test:
$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)
Certainly, setting LC_CTYPE does not seem to change its behaviour.
Perhaps you've set LC_ALL, which overrides LC_CTYPE.
$ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.4 $ cat test ∀ $ file test test: UTF-8 Unicode text $ ghc -e 'putStr =<< readFile "test"' ∀ $ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"' ∀ $ LC_ALL=C ghc -e 'putStr =<< readFile "test"' ∀
I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.
Which operating system and shell are you using?
Could you reproduce the issue running
$ export LC_ALL=C
$ cpphs test
?
ghc-7.6.1 on MacOSX 10.7.5, with bash. ghc-7.8.3 on Windows 7 Professional SP1, with bash.
Cannot reproduce the issue, even with LC_ALL=C.
Did you mean export LC_ALL=C
?
Which is the output of
$ locale
$ LC_ALL=C locale
?
$ locale # MacOSX LANG="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_CTYPE="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_ALL= $ LC_ALL=C locale LANG="en_GB.UTF-8" LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C"
The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.
ghc-7.6.1 on MacOSX 10.7.5, with bash. ghc-7.8.3 on Windows 7 Professional SP1, with bash.
I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.
Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252
would work.
GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".
I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:
I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.
Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.
Thanks for fixing the issue (tested on Agda). Could you release a new version, please.
cpphs-1.20.2 released.
Some Agda users have reported an error when installing Agda in their locale environments.
A MWE (adapted from this example) is the following:
@nad wrote here:
Blocking https://github.com/agda/agda/issues/2112.