ndmitchell / hoogle

Haskell API search engine
http://hoogle.haskell.org/
Other
738 stars 134 forks source link

Fix broken Unicode handling in `instance Store Char` #20

Closed hvr closed 11 years ago

hvr commented 11 years ago

The previous implementation of instance Store Char silently truncated all Unicode code points to 8-bit. This commit extends the implementation to use UTF8 encoding for storing Char values (resulting in the same on-disk format for pure ASCII identifiers). The actual UTF8 encoding is provided by the text package which is added to the build dependencies.

Note: Due to the UTF8 encoding, the getList implementation couldn't be retained in the previous form. However, the long-term solution should be to switch to using Text instead of [Char] for serializing text strings anyway, which can then be simply serialized as UTF8 encoded ByteStrings.

ndmitchell commented 11 years ago

I'm guessing the loss of getList makes loading up documentation and displaying it slower. But since loading up documentation is relatively quick compared to searching, and searching doesn't really store any strings, I suspect it doesn't matter. I agree that moving to Text for these pieces would be a much nicer solution.