tjake / Solandra

Solandra = Solr + Cassandra
Apache License 2.0
882 stars 150 forks source link

Problems deleting documents using the Lucandra IndexWriter #44

Closed magloven closed 13 years ago

magloven commented 14 years ago

========= SHORT =========

Problems with the Lucandra IndexWriter:

  1. Calling the deleteDocuments(Query) method will at most delete 1000 documents
  2. Passing a MatchAllDocsQuery to the deleteDocuments(Query) method will not remove any documents at all (since the Lucandra IndexSearcher won't return any hits using such query).
  3. There are no deleteAll() method (as in Lucene 2.9.0 and later)

Request for improvements:

  1. Introduce a deleteDocuments(Query query, int maxNumberOfDocumentsToDelete) method for IndexWriter
  2. Ensure MatchAllDocsQuery will be accepted when using a Lucandra IndexSearcher
  3. If possible - introduce a deleteAll() method for IndexWriter

========= LONG =========

We have an increasing number of indexes that contains lots of small documents. most of the fields contain arbitrary/unknown values. Some contain a known set of values. The documents are based on data stored in Cassandra.

Sometimes an index must be "synced" with the data currently stored in Cassandra. Just update/re-index the index using the data currently stored in Cassandra will just not do. Sure, lots of documents will be better "up to date" but the index will still contain obsolete/dirty data (data that no longer exist in Cassandra).

The preferred solution in most of our cases are to completely clear the index from all documents and then re-index it using the data currently stored in Cassandra. Lucene provides at least two ways to delete all documents from an index using a Lucene IndexWriter:

None of them are supported when using the Lucandra IndexWriter.

To delete all documents using Lucandra, we first presumed it could be done like this:

  1. Build a massive Boolean query with all known values for a specific field
  2. Call IndexWriter.deleteDocuments(Query) passing the massive Boolean

But - we found out that this will do only if the index contains at most 1000 documents. The deleteDocuments(Query) method executes a search to find all documents to be removed (IndexWriter#271) and the search result will at most contain 1000 hits.

To delete all documents "for real" using Lucandra we have to:

  1. Build a massive Boolean query with all known values for a specific field
  2. Find out how many deletions that are neccesary by querying the index with a Lucandra Searcher and "enough" number of ”max hits to return”.
  3. Call IndexWriter.deleteDocuments(Query) passing the massive Boolean query several times (result of step 2 / 1000 + 1)

In our opinion - the Lucandra indexWriter (and Lucandra IndexSearcher) have some important issues that need to be handled. See SHORT above for suggested improvements.

tjake commented 14 years ago

Thanks for the feedback! The good news is the v2 of Lucandra, which is about to be pushed, handles these issues. Specifically MatchAllDocs delete.

Are you opposed to using solr?

magloven commented 14 years ago

Great! Looking forward to a v2 of Lucandra.

We're migrating our portal application to use Cassandra as backend and we've been using Lucene ever since it was created. Haven't really looked that much at Solr since Lucene will do the job and we have implemented all functions and UIs we need.

Will for sure check out Solr sometime in the future.