johnl / xapian-fu

A nicer Ruby interface for the Xapian full text indexer
https://www.rubydoc.info/github/johnl/xapian-fu
Other
51 stars 15 forks source link

Text encoding #12

Open singpolyma opened 13 years ago

singpolyma commented 13 years ago

Since all text in xapian is utf-8, strings coming back out of xapian-fu should be encoded in utf-8 (probably just by calling force_encoding('utf-8') on strings as they come out)

Right now the strings come out marked as local encoding, but are actually utf-8, and this causes some problems.

djanowski commented 13 years ago

What if you set Encoding.default_external?

singpolyma commented 13 years ago

Sure, I can get around it, but the point is that since all of the data is always in fact going to be UTF-8, the library should honour that.

djanowski commented 13 years ago

I guess that's right, as long as Xapian always stores/returns UTF-8.

What should we do when storing? Should an exception be raised if the string is not UTF-8?

singpolyma commented 13 years ago

I'm not sure how the Xapian bindings handle things, but if they just use the raw bytestream and assume it's UTF-8 (because, yes, Xapian alwas stores/returns in UTF-8) then you should probably call .encode('utf-8') and if there's a problem ruby will throw the exception for you :)