idris-lang / Idris-dev

A Dependently Typed Functional Programming Language
http://idris-lang.org
Other
3.44k stars 644 forks source link

Build error with Unicode #94

Closed Warbo closed 11 years ago

Warbo commented 12 years ago

When compiling I got as far as type-checking lib/Prelude/Complex.idr then got the error "hGetContents: invalid argument (invalid byte sequence)".

This happened with "cabal install idris" (version 0.9.5.1) and with a clone of commit cea7205585358b3c28c439cfe598e92792f0f2b2

I changed the copyright header in that file from using a non-ASCII character to "(c)" and this made the error go away, allowing me to compile successfully. I don't know enough about Unicode handling in Haskell/Idris to stop this reoccuring, but I thought I'd raise the issue and my quick hack.

I'm running Debian unstable on an OLPC XO-1 laptop. Here are some possibly relevant numbers:

$ uname -a Linux olpc 2.6.32-5-486 #1 Fri Dec 10 15:32:53 UTC 2010 i586 GNU/Linux

$ dpkg -l ghc | grep "ii" ii ghc 7.4.1-4 i386 The Glasgow Haskell Compilation system ii libghc-ansi-terminal-de 0.5.5-3+b1 i386 Simple ANSI terminal support, with Windows compatibi ii libghc-ansi-wl-pprint-d 0.6.4-1+b1 i386 Wadler/Leijen Pretty Printer for colored ANSI termin ii libghc-dlist-dev 0.5-3+b1 i386 Haskell library for Differences lists ii libghc-hostname-dev 1.0-4+b1 i386 providing a cross-platform means of determining the ii libghc-mtl-dev 2.1.1-1 i386 Haskell monad transformer library for GHC ii libghc-quickcheck2-dev 2.4.2-1+b1 i386 Haskell automatic testing library for GHC ii libghc-random-dev 1.0.1.1-1+b1 i386 Random number generator for Haskell ii libghc-regex-base-dev 0.93.2-2+b2 i386 GHC library providing an API for regular expressions ii libghc-regex-posix-dev 0.95.1-2+b1 i386 GHC library of the POSIX regex backend for regex-bas ii libghc-smallcheck-dev 0.6-1+b1 i386 Another lightweight testing library ii libghc-syb-dev 0.3.6.1-1 i386 Generic programming library for Haskell ii libghc-test-framework-d 0.6-1+b1 i386 Framework for running and organising tests ii libghc-test-framework-q 0.2.12.1-1+b1 i386 QuickCheck2 support for the test-framework package. ii libghc-text-dev 0.11.2.0-1 i386 efficient packed Unicode text type for Haskell - GHC ii libghc-transformers-dev 0.3.0.0-1 i386 Haskell monad transformer library ii libghc-utf8-string-dev 0.3.7-1+b1 i386 GHC libraries for the Haskell UTF-8 library ii libghc-x11-dev 1.5.0.1-1+b2 i386 Haskell X11 binding for GHC ii libghc-xml-dev 1.3.12-1+b2 i386 A simple Haskell XML library - GHC libraries ii libghc-xmonad-dev 0.10-4+b2 i386 Lightweight X11 window manager; libraries

$ ghc -v Glasgow Haskell Compiler, Version 7.4.1, stage 2 booted by GHC version 7.4.1 Using binary package database: /usr/lib/ghc/package.conf.d/package.cache Using binary package database: /home/chris/.ghc/i386-linux-7.4.1/package.conf.d/package.cache hiding package text-0.11.2.0 to avoid conflict with later version text-0.11.2.3 hiding package mtl-2.1.1 to avoid conflict with later version mtl-2.1.2 wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-bd29cb1ca1b712d64e00ac9207f87d0a wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-ec87c5d9609a1d46da031ef5d51c4f79 wired-in package base mapped to base-4.5.0.0-c8e7184681d410015e93df85fc49e9dd wired-in package rts mapped to builtin_rts wired-in package template-haskell mapped to template-haskell-2.7.0.0-fea440f2bc02cf9a412f25b6b74c4a70 wired-in package dph-seq not found. wired-in package dph-par not found. Hsc static flags: -static * Deleting temp files: Deleting: * Deleting temp dirs: Deleting: ghc: no input files Usage: For basic information, try the `--help' option.

$ file lib/Prelude/Complex.idr lib/Prelude/Complex.idr: UTF-8 Unicode text

$ hexdump -C lib/Prelude/Complex.idr | head 00000000 7b 2d 0a 20 20 c2 a9 20 32 30 31 32 20 43 6f 70 |{-. .. 2012 Cop| 00000010 79 72 69 67 68 74 20 4d 65 6b 65 6f 72 20 4d 65 |yright Mekeor Me| 00000020 6c 69 72 65 0a 2d 7d 0a 0a 0a 6d 6f 64 75 6c 65 |lire.-}...module| 00000030 20 50 72 65 6c 75 64 65 2e 43 6f 6d 70 6c 65 78 | Prelude.Complex| 00000040 0a 0a 69 6d 70 6f 72 74 20 42 75 69 6c 74 69 6e |..import Builtin| 00000050 73 0a 69 6d 70 6f 72 74 20 50 72 65 6c 75 64 65 |s.import Prelude| 00000060 0a 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |..--------------| 00000070 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |----------------| 00000080 20 52 65 63 74 61 6e 67 75 6c 61 72 20 66 6f 72 | Rectangular for| 00000090 6d 20 0a 0a 69 6e 66 69 78 20 36 20 3a 2b 0a 64 |m ..infix 6 :+.d|

edwinb commented 12 years ago

This is slightly surprising since the version of GHC ought to handle unicode Strings. I suppose what I'll do is change it to a (c) to fix the compilation error, and leave this issue open in case anyone is able to explain it. Thanks for mentioning.

wrwills commented 11 years ago

I had a similar issue when building on a new system where I hadn't set done my locale configuration properly.

With LANG=C hGetContents was choking on the line "-- and defining i+i = i and i+s = s = s+i for all s ∈ S.\" in Maybe.idr

Running export LANG=en_GB.UTF-8 and then building again fixed it.

Warbo commented 11 years ago

It makes sense that my locale wasn't set properly, as I'd installed Debian via debootstrap, which only does enough configuration to get chroot working. I'll add 'set locale' to my post-install checklist next time ;)

LeifW commented 11 years ago

I think this specific case can be closed now? But there was more in-depth discussion on the mailing list, for allowing unicode in the Idris sources in the future. Something about just having Idris simply assume the sources are UTF8?

Warbo commented 11 years ago

I'm happy for it to close.

LeifW commented 10 years ago

tjice just reported something that looks rather similiar in IRC: http://codepad.org/hsRtppRm Builds idris fine, but then idris barfs trying to compile the .idr libs.

LeifW commented 10 years ago

Doing some digging - I suspect hGetContents is being called from the readFile in Idris/Chaser.hs. This issue might shed some light - https://github.com/finnsson/template-helper/issues/2

LeifW commented 10 years ago

Perhaps we could set to locale to utf8 on each file handle we open (to force all the .idr files to be read as utf8, rather than using the system locale - "The default encoding when a Handle is created is localeEncoding, namely the default encoding for the current locale." - https://hackage.haskell.org/package/base-4.7.0.0/docs/System-IO.html#g:23 Or another idea - could we set the LANG var or whatever to unicode during the part of the idris build process where it builds the stdlibs - leaving the end user free to write .idr files in non-utf8 on their own?

david-christiansen commented 10 years ago

This sounds horribly complicated. In my opinion, the right thing to do is to just define UTF-8 as the one true encoding for Idris files, and arrange for the Haskell code to always use it.

LeifW commented 10 years ago

Thinking of adding a readUtf8File to say Util/System.hs, that mimics readFile, only setting encoding of the file handle to utf8. Would also need to replace file writing from !-suffixed repl commands by write equivalent, I imagine.

david-christiansen commented 10 years ago

Sounds reasonable if such a thing isn't already in the libraries.

/David (from phone) Den 18 jun 2014 17:50 skrev "Leif Warner" notifications@github.com:

Thinking of adding a readUtf8File to say Util/System.hs, that mimics readFile, only setting encoding of the file handle to utf8. Would also need to replace file writing from !-suffixed repl commands by write equivalent, I imagine.

— Reply to this email directly or view it on GitHub https://github.com/idris-lang/Idris-dev/issues/94#issuecomment-46454260.

LeifW commented 10 years ago

Oh - we have utf8-string as a dep in .cabal, which already has readFile.