segabor / Hunspell

Ruby wrapper for the famous spell checker library hunspell.
GNU Lesser General Public License v3.0
35 stars 11 forks source link

Suggestion strings being encoded incorrectly #3

Closed henrebotha closed 8 years ago

henrebotha commented 8 years ago

Try the following:

sp = Hunspell.new('en_ZA.aff', 'en_ZA.dic')
s = sp.suggest("Skjhd")[0]
# => "Ski\xE2\x80\x99d"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.encode("utf-8")
# => Encoding::UndefinedConversionError: "\xE2" from ASCII-8BIT to UTF-8

(I'm using a South African English dictionary from here.)

What seems to be happening is that the string is incorrectly encoded as ASCII, when it is in fact UTF-8. Since the UTF code points are not valid ASCII, it throws an error when attempting to convert them from ASCII (supposedly) to UTF.

If I force the encoding, the string renders correctly:

s.force_encoding("utf-8")
# => "Ski’d"
segabor commented 8 years ago

Thanks for the report. I will provide a fix based on this article http://tenderlovemaking.com/2009/06/26/string-encoding-in-ruby-1-9-c-extensions.html

segabor commented 8 years ago

Fix is out with release 0.1.4. Please let me know if it's okay.