Closed jwiegley closed 9 years ago
Pinging @snoyberg @chrisdone
@hvr added some unicode stuff to Hoogle recently.
that was #20 to be more specific
@jwiegley are you trying pre #20 or post?
The version I'm using is from months ago, so I'm pretty sure pre. I'll try the latest to see what effect that has.
This looks to still be the case, for example, by searching for "e +base-unicode-symbols". http://hoogle.haskell.org/?hoogle=e%20%2Bbase-unicode-symbols
Hoogle currently claims that the HTML it serves is encoded using iso-8859-1
, by including <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
in its <head>
, but it seems that simply changing this to utf-8
is not enough to fix it. A snippet from looking at the above page in od
shows:
0144520 2f 73 70 61 6e 3e 20 3a 3a 20 4f 72 64 20 b1 20 >/span> :: Ord . <
0144540 3d 26 67 74 3b 20 b1 20 2d 26 67 74 3b 20 b1 20 >=> . -> . <
0144560 2d 26 67 74 3b 20 42 6f 6f 6c 3c 2f 61 3e 3c 2f >-> Bool</a></<
But looking at the relevant part of base-unicode-symbols.txt
inside input/hoogle.tar.gz
shows:
0002220 54 45 52 2d 54 48 41 4e 20 4f 52 20 45 51 55 41 >TER-THAN OR EQUA<
0002240 4c 20 54 4f 0a 28 e2 89 a5 29 20 3a 3a 20 4f 72 >L TO.(...) :: Or<
0002260 64 20 ce b1 20 3d 3e 20 ce b1 20 2d 3e 20 ce b1 >d .. => .. -> ..<
I'm focusing on the unicode character alpha α (U+03B1), used as the type variable, which is causing an issue here. In the HTML it comes out as a single byte 0xb1
, but in the text input file it's encoded as two bytes, 0xce 0xb1
(that is, it is encoded properly as UTF-8). Somewhere inside Hoogle, characters encoded as more than one byte are being truncated to one byte somehow. Maybe something in one of the Char8
ByteString modules is the culprit? The following modules import them:
src/General/Template.hs src/General/Store.hs src/General/Web.hs src/General/Log.hs src/Input/Cabal.hs src/Output/Names.hs src/Output/Tags.hs src/Output/Items.hs src/Output/Types.hs src/Action/Server.hs
Hoogle is meant to use UTF8 encoded bytestrings everywhere. I suspect some of the ByteString.Char8 modules can be replaced with ByteString, which reduces the possibility of messing up character encoding. I should definitely be saying the HTML is UTF8, but there will be somewhere else too.
I suspect the long-term approach to making sure these kind of errors are eliminated by definition is to have a General.Str
module exporting an abstract type newtype Str = Str ByteString
, and then only exposing the very small subset of ByteString operations I actually use, plus safe UTF8 encoding/decoding.
Why not use Text? For performance?
Three reasons:
General.Store
can mmap a file to a pointer, then creating O(1) values from it. Since Text isn't a Ptr underneath you can't do that with Text, you have to copy.I suspect the long-term approach to making sure these kind of errors are eliminated by definition is to have a General.Str module exporting an abstract type newtype Str = Str ByteString, and then only exposing the very small subset of ByteString operations I actually use, plus safe UTF8 encoding/decoding.
Yeah, I was going to suggest the same before reading this paragraph. A ByteString
wrapped up in an opaque type which can be extracted to Text as UTF8String -> Text
via decodeUtf8
would be a nice abstraction to avoid ever accidentally exposing the ByteString
to anywhere public. I remember debating in the past with others that such a type is valuable for when Text
's overhead is too high.
I'm not sure I ever want Text - I'm doing barely any textual operations, and the ones I do have are written in C or trivial append things where lifting the underlying ByteString works just fine. For input I need String -> UTF8String and for output I need to write to Warp or the seralisation layer, both of which want UTF8 Bytestring. I'll probably have a go at writing such a wrapping tomorrow - hopefully just fixing up the types will fix this Unicode bug at the same time.
I meant rather for public consumption. I think generally Text
is now the type people reach for by default. Usually the whole codebase (e.g. mine and ours at FPCo) are all Text
and then we unpack
/pack
to interface with the few libraries that use String
. In the case of hoogle that's just a negligible inconvenience rather than a performance overhead. I'm not suggesting you change your API, just explaining why by habit I think in terms of Text
and not String
.
@chrisdone in some cases I would even go as far to use a lighter wrapper around ByteArray#
s, e.g. if I have lots of small (< 40 bytes) utf8-encoded strings, as ByteString
s have quite a bit of overhead due to their richer abstraction and also bear the risk of unexpected sharing (which can cause space-leaks due to GC retention)
In Hoogle there are really two passes. During generation (since my VM has only 1Gb RAM) then all space is precious. I do common string elimination, and a light bytestring might be helpful (I tried the new one in the ByteString library but it didn't make a huge difference). For every 100Mb of space I add the generation takes an additional minute, because its hitting swap.
During running, I have huge bytestrings, which are backed by memory mapped files. There ByteStrings are perfect (and a UTF8 newtype around works fine).
I've pushed a bunch of patches and added a few tests. I introduced a new General.Str
module which encapsulates UTF8 encoded bytestrings, but I've only applied it in select places. At some point in the future I'll go round and clean it up. However, for users, most unicode things "just work" - if you spot something that doesn't let me know. These patches will go live on the http://hoogle.haskell.org/ server in about 8 hours.
All live. I'm not aware of anything that doesn't work with Unicode, but would be surprised if there wasn't anything. Please raise follow up tickets.
If I search for
Eq a => [a] -> a -> Bool
on my private Hoogle server, I get this result at the top:It looks like unicode characters are not being rendered properly?