rnewson / couchdb-lucene

Enables full-text searching of CouchDB documents using Lucene
Apache License 2.0
768 stars 147 forks source link

querying utf8 documents with include_docs=true giving wrong results #250

Closed RaviBolla closed 7 years ago

RaviBolla commented 7 years ago

When I enter documents containing utf8-chars like öäüß in a couchdb via Futon and query them afterwards with

curl http://127.0.0.1:5986/_fti/local/test/_design/feedSearch/by_document?q=title:chóc*&limit=5&include_docs=true

I get the following response:

{ "rows":[  {  "score":1,
         "doc":{  "_rev":"3-196e062ccfdb20db36b2f578114cda5d", "description":"ra",
            "_id":"1",
            "title": **"chócacao"**
         },
         "id":"1",
         "fields":{  
            "title":**"chócacao"**
         }}]}

doc.title is not a proper utf8 string, but if you check the fileds.title is a proper utf8 string.

rvanzon commented 7 years ago

facing the same problem. Only an UTF-8 conversion in code of the provided results work.

streunerlein commented 7 years ago

@rvanzon: I am having same problems as you, would you mind sharing your fix?

rvanzon commented 7 years ago

@streunerlein I use https://github.com/ashtuchkin/iconv-lite to decode the results to utf-8.

streunerlein commented 7 years ago

Thanks @rvanzon. I tried iconv directly, no luck as well as with iconv-lite, I guess it is totally due to https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding and I don't have access to the original buffers.

I'll check if i can fix couchdb-lucene

streunerlein commented 7 years ago

https://github.com/streunerlein/couchdb-lucene/commit/b46af6c81696a94e943b44241f5f6246caa62c0a fixes the issue, will create a PR for this

rvanzon commented 7 years ago

@streunerlein great! thanks :-)