Open ssured opened 11 years ago
Yes I saw the article, very interesting read.
I too was interested in @garysieling approach to merging two indexes that were built separately, this is something that I hadn't considered for lunr but it sounds like there are a number of interesting use cases that might be worth exploring, I'd be very interested in seeing lunr working as a full text index in CouchDB!
I was thinking a little about an implementation for this yesterday evening;
Add a lunr.Index.prototype.concat
method, this would work similarly to Array.prototype.concat
, taking a variable number of other indexes and returning a new index which combined all the passed indexes.
The only edge cases I can think of are around the pipeline of each index. To work the indexes being merged should all have identical pipelines, perhaps throwing an error if they do not all match is the way to handle this.
If you'd like to put together an implementation of this that'd be great, I'd also be interested in seeing a proof of concept of this working with WebWorkers or in CouchDB.
Yeah, I made the assumption that the indexes/fields were the same, you definitely would want sanity checks around that. I'm also assuming that the combined indexes are disjoint - you don't have two documents with different data and the same ID, the same document in multiple indexes, etc. It's possible that code would work in one of those cases, I just never tested it.
@olivernn , a lunr.Index.prototype.concat
sounds really good. How would I build it? I have a case where I upload my index to the server. It's easier to query smaller chunks of the indexes and then merge the results somehow.
@olivernn I just had a look at it, and seems to me that merging two indexes could be impossible. The reason is that the BM25 score includes the number of documents that contain the term n(q)
in the idf
function, and this changes as soon as you add a new document, invalidating the score.
The two options remaining that I see would be to either:
n(q)
and score again, which means you would have to recalculate all the factors included in scoring which might be inefficient on it's own.I don't know your code base as well as you do, so please correct me if I'm wrong.
http://garysieling.com/blog/building-a-full-text-index-in-javascript
On this page a way of merging two indexes into one is described. It's a great way to split work over several processors/machines and then merge the result. My guess is that using this merge function, it will be easier to implement WebWorkers for background indexing once this merge is supported.
My use case is to implement indexing in a NoSQL environment with map/reduce functionality. Maybe this could solve the absence of a full text index in CouchDB.