ndmitchell / hoogle

Haskell API search engine
http://hoogle.haskell.org/
Other
753 stars 137 forks source link

Unicode rendering is incorrect #30

Closed jwiegley closed 9 years ago

jwiegley commented 11 years ago

If I search for Eq a => [a] -> a -> Bool on my private Hoogle server, I get this result at the top:

() :: Eq ± => [±] -> ± -> Bool
base-unicode-symbols Data.List.Unicode

It looks like unicode characters are not being rendered properly?

jwiegley commented 11 years ago

Pinging @snoyberg @chrisdone

ndmitchell commented 11 years ago

@hvr added some unicode stuff to Hoogle recently.

hvr commented 11 years ago

that was #20 to be more specific

ndmitchell commented 11 years ago

@jwiegley are you trying pre #20 or post?

jwiegley commented 11 years ago

The version I'm using is from months ago, so I'm pretty sure pre. I'll try the latest to see what effect that has.

hdgarrood commented 9 years ago

This looks to still be the case, for example, by searching for "e +base-unicode-symbols". http://hoogle.haskell.org/?hoogle=e%20%2Bbase-unicode-symbols

Hoogle currently claims that the HTML it serves is encoded using iso-8859-1, by including <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> in its <head>, but it seems that simply changing this to utf-8 is not enough to fix it. A snippet from looking at the above page in od shows:

0144520 2f 73 70 61 6e 3e 20 3a 3a 20 4f 72 64 20 b1 20  >/span> :: Ord . <
0144540 3d 26 67 74 3b 20 b1 20 2d 26 67 74 3b 20 b1 20  >=&gt; . -&gt; . <
0144560 2d 26 67 74 3b 20 42 6f 6f 6c 3c 2f 61 3e 3c 2f  >-&gt; Bool</a></<

But looking at the relevant part of base-unicode-symbols.txt inside input/hoogle.tar.gz shows:

0002220 54 45 52 2d 54 48 41 4e 20 4f 52 20 45 51 55 41  >TER-THAN OR EQUA<
0002240 4c 20 54 4f 0a 28 e2 89 a5 29 20 3a 3a 20 4f 72  >L TO.(...) :: Or<
0002260 64 20 ce b1 20 3d 3e 20 ce b1 20 2d 3e 20 ce b1  >d .. => .. -> ..<

I'm focusing on the unicode character alpha α (U+03B1), used as the type variable, which is causing an issue here. In the HTML it comes out as a single byte 0xb1, but in the text input file it's encoded as two bytes, 0xce 0xb1 (that is, it is encoded properly as UTF-8). Somewhere inside Hoogle, characters encoded as more than one byte are being truncated to one byte somehow. Maybe something in one of the Char8 ByteString modules is the culprit? The following modules import them:

src/General/Template.hs src/General/Store.hs src/General/Web.hs src/General/Log.hs src/Input/Cabal.hs src/Output/Names.hs src/Output/Tags.hs src/Output/Items.hs src/Output/Types.hs src/Action/Server.hs

ndmitchell commented 9 years ago

Hoogle is meant to use UTF8 encoded bytestrings everywhere. I suspect some of the ByteString.Char8 modules can be replaced with ByteString, which reduces the possibility of messing up character encoding. I should definitely be saying the HTML is UTF8, but there will be somewhere else too.

I suspect the long-term approach to making sure these kind of errors are eliminated by definition is to have a General.Str module exporting an abstract type newtype Str = Str ByteString, and then only exposing the very small subset of ByteString operations I actually use, plus safe UTF8 encoding/decoding.

hdgarrood commented 9 years ago

Why not use Text? For performance?

ndmitchell commented 9 years ago

Three reasons:

  1. Memory footprint. Text takes basically twice what ByteString does, especially for code which is 99+% ASCII.
  2. C interop is much harder with Text, which makes things like writing an optimised C search for the hot-loops harder.
  3. Everything is stored as either ByteString or Storable Vector, which means General.Store can mmap a file to a pointer, then creating O(1) values from it. Since Text isn't a Ptr underneath you can't do that with Text, you have to copy.
chrisdone commented 9 years ago

I suspect the long-term approach to making sure these kind of errors are eliminated by definition is to have a General.Str module exporting an abstract type newtype Str = Str ByteString, and then only exposing the very small subset of ByteString operations I actually use, plus safe UTF8 encoding/decoding.

Yeah, I was going to suggest the same before reading this paragraph. A ByteString wrapped up in an opaque type which can be extracted to Text as UTF8String -> Text via decodeUtf8 would be a nice abstraction to avoid ever accidentally exposing the ByteString to anywhere public. I remember debating in the past with others that such a type is valuable for when Text's overhead is too high.

ndmitchell commented 9 years ago

I'm not sure I ever want Text - I'm doing barely any textual operations, and the ones I do have are written in C or trivial append things where lifting the underlying ByteString works just fine. For input I need String -> UTF8String and for output I need to write to Warp or the seralisation layer, both of which want UTF8 Bytestring. I'll probably have a go at writing such a wrapping tomorrow - hopefully just fixing up the types will fix this Unicode bug at the same time.

chrisdone commented 9 years ago

I meant rather for public consumption. I think generally Text is now the type people reach for by default. Usually the whole codebase (e.g. mine and ours at FPCo) are all Text and then we unpack/pack to interface with the few libraries that use String. In the case of hoogle that's just a negligible inconvenience rather than a performance overhead. I'm not suggesting you change your API, just explaining why by habit I think in terms of Text and not String.

hvr commented 9 years ago

@chrisdone in some cases I would even go as far to use a lighter wrapper around ByteArray#s, e.g. if I have lots of small (< 40 bytes) utf8-encoded strings, as ByteStrings have quite a bit of overhead due to their richer abstraction and also bear the risk of unexpected sharing (which can cause space-leaks due to GC retention)

ndmitchell commented 9 years ago

In Hoogle there are really two passes. During generation (since my VM has only 1Gb RAM) then all space is precious. I do common string elimination, and a light bytestring might be helpful (I tried the new one in the ByteString library but it didn't make a huge difference). For every 100Mb of space I add the generation takes an additional minute, because its hitting swap.

During running, I have huge bytestrings, which are backed by memory mapped files. There ByteStrings are perfect (and a UTF8 newtype around works fine).

ndmitchell commented 9 years ago

I've pushed a bunch of patches and added a few tests. I introduced a new General.Str module which encapsulates UTF8 encoded bytestrings, but I've only applied it in select places. At some point in the future I'll go round and clean it up. However, for users, most unicode things "just work" - if you spot something that doesn't let me know. These patches will go live on the http://hoogle.haskell.org/ server in about 8 hours.

ndmitchell commented 9 years ago

All live. I'm not aware of anything that doesn't work with Unicode, but would be surprised if there wasn't anything. Please raise follow up tickets.