rnewson / couchdb-lucene

Enables full-text searching of CouchDB documents using Lucene
Apache License 2.0
769 stars 145 forks source link

Parallel index access (read / write)? #234

Open kr428 opened 8 years ago

kr428 commented 8 years ago

In our environment, we see the couchdb-lucene index pretty often under quite heavy load because the CouchDB manages a load of documents with attachments (right now, there are a bit more than 300k documents, and the database at CouchDB compression level 2 or 3 is almost 9GB in size). Situation is that, obviously(?), whenever the couchdb-lucene does indexing work, the index is not available for read access. So, worst case, if there is a very high frequency of changes on the database (like 100 or 200 large documents added in one batch), reading the index for many clients is extremely delayed or runs into timeout errors.

Is this how it's supposed to work? Lucene locking the index for updates or something like that? Or is it "just" a threading / parallel processing issue in couchdb-lucene? Does anyone else see this behaviour too? Any possible workarounds to that? Cheers and thanks, Kristian

rnewson commented 8 years ago

That shouldn't be the case but I suspect what you're experiencing is normal (and avoidable).

When you query an index through couchdb-lucene, just as when you query a regular view in couchdb, the code applies any updates that have happened to the database since the last query before processing that query, to ensure you get the correct result.

Now, in couchdb-lucene, there's a thread that continues in the background to keep applying updates as they happen, in the hope that the index will already be fresh, or close to it, when you query.

You can detect if this is your problem by searching with the ?stale=ok parameter. This will process the query immediately; it won't block to apply any new updates. You should ensure to query the index without stale=ok from time to time, though, to ensure those updates are eventually made.

Of course it's possible you're hitting bugs or deficiencies in couchdb-lucene, but please try the above first.

kr428 commented 8 years ago

Thanks for your feedback. So I'll see whether the application goes well with ?stale=ok but generally I doubt so. In this situation, there essentially is a client that adds content to parts of the CouchDB infrastructure in a first step and wants to provide a list of the (newly added) documents in the second setp, so I guess returning stale data would be same as "wrong". From that point of view it seems more of a conceptual issue on the client side... :|

rnewson commented 8 years ago

stale=ok is often synonmous with 'wrong'. :) I mention it as a debugging method only in this context.