rnewson / couchdb-lucene

Enables full-text searching of CouchDB documents using Lucene
Apache License 2.0
769 stars 147 forks source link

couchdb-lucene stucks with indexing if attachment is docx, pptx, xlsx format #288

Open serhaton opened 3 years ago

serhaton commented 3 years ago

I am using couchdb-lucene 2.2.0 installed on Windows Server 2019. Couchdb version I am using is 3.1.1

Fulltext searching works fine with document properties. I also wanted to index based on the content of attachments of the documents. So I configured Design Document as follows

{
  "_id": "_design/fts",
  "_rev": "2-ec9dfea8eaa44056d74b44135776ef05",
  "fulltext": {
    "by_message": {
      "index": "function(doc) { var ret=new Document(); if (doc._attachments) {for(var a in doc._attachments){ret.attachment('file',a);}} return ret; }"
    }
  }
}

When I upload attachments to documents with type pdf, txt, word everything works fine as expected. Below is a search result of "Sesame Street" keyword in a ppt document and it works fine.

C:\Users\serhato>curl "http://localhost:5986/localx/denemeserhat/_design/fts/by_message?q=file:sesame%20street"
{"q":"file:sesame default:street","fetch_duration":0,"total_rows":1,"limit":25,"search_duration":1,"etag":"5fe7a890d6f8","skip":0,"rows":[{"score":0.8342865109443665,"id":"2da37b26c45c8a62f7824f7aab015e01"}]}

Then I upload any docx file ( even an empty one with only some plain text. For this specific problem my word docx contains 'This is an example document which I have indexing problem on couchdb-lucene' text only ) or pptx attachment to any of the documents and re-run the above request. If gives timeout error forever.

C:\Users\serhato>curl "http://localhost:5986/localx/denemeserhat/_design/fts/by_message?q=file:sesame%20street"
{"code":500}

The log shows only below message

2021-02-28 12:16:11,608 WARN [HttpChannel] handleException /localx/denemeserhat/_design/fts/by_message java.io.IOException: Search timed out.

If I try to seach with 'problem' keyword which is in word document result is same timeout.

C:\Users\serhato>curl "http://localhost:5986/localx/denemeserhat/_design/fts/by_message?q=file:problem"
{"code":500}

If I try with stale=ok it response with empty result.

C:\Users\serhato>curl "http://locahost:5986/localx/denemeserhat/_design/fts/by_message?q=file:problem&stale=ok"
{"q":"file:problem","fetch_duration":0,"total_rows":0,"limit":25,"search_duration":0,"etag":"7d0b10eb5800","skip":0,"rows":[]}

So indexing is somehow stuck forever. restarting the Couchdb-Lucene does not change anything. If I delete the document with docx file from couchdb and after that if I restart couchdb-lucene everything starts working again.

I believe problem is related to zip format documents such as docx, xlsx and pptx etc.

serhaton commented 3 years ago

EDIT:

I was suspicious that maybe the problem is related to Windows running Lucene so I decided to install CouchDB and CouchDB-Lucene on Ubuntu 20.04 Server. But the result is same.

Everthing works fine until I upload a Docx or pptx document. But it works fine with doc, rtf, txt and pdf files.

I am really stuck with this problem