hunt-framework / hunt

A flexible, lightweight search platform
59 stars 10 forks source link

insertList doesn not merge existing occurences with new ones #67

Closed UweSchmidt closed 10 years ago

UweSchmidt commented 10 years ago

In a search for the occurrences of a word, just the 1. document indexed containing this word is found. The reason is, that the Occurrence maps always contain a single document id. This happens because insertList does a simple union on the words in a StringMap and the 1. arg wins, e.g. for PrefixTree instance:

  insertList kvs (DmPT pt) =
    mkDmPT $ SM.union pt (SM.fromList kvs)

here a unionWith must be called with merge from Occurences.

insertList acts like unionWith in the Index Class. So it must get a merge op as parameter like unionWith. So all calls of insertList for inserting words must be changed.

UweSchmidt commented 10 years ago

Things get better. With the above sketched modification of insertList insertion works a little bit more. If we insert 3 packages with a single json command, only a single package is found when looking for type:package. But when these packages are added separately wtih 3 insert commands we get 3 results when searching for type:package.

Is there anything wrong with the new parallel insertion? with batchAddWordsM?

ulfsauer0815 commented 10 years ago

This bug was probably introduced with the recent change that moved the compressed indexes into a separate module. The implementation does not use the ComprOccPrefixTree anymore, which merges the Occurrences, but the old DmPrefixTree implementation.

An easy fix would be to either

I did not check batchAddWordsM but I might have a coding environment set up this weekend and have a look at it.

UweSchmidt commented 10 years ago

Changes for insertList are done (locally, not yet on github), also in hunt-compression. There is still an old batchAddWords, I will try that for localizing the error.

sebastian-philipp commented 10 years ago

A cause for this bug might be some code duplication between Index implementations for basic data structures and I think that I've introduced the same bug in the Rtree-Index again.

We have 5 instances:

and 3 of these 5 Instances are buggy regarding the merge of Occurrences in insertList.

UweSchmidt commented 10 years ago

The bug is located. It sits in the last line of this code part from Hunt.Index.PrefixTreeIndex

instance Index DmPrefixTree where
  ...
  insertList op kvs (DmPT pt) =
    mkDmPT $ SM.unionWith op pt (SM.fromList kvs)

insertList is called with a list of pairs (words, occurrences), but the words are not always different, e.g. this list [("package", occ1), ("package", occ2)] is inserted, and this leads to a single entry for package with occ2 as value due to SM.fromList where the 2. pair wins.

This bug will be crushed soon.

UweSchmidt commented 10 years ago

Bug with lost docIds removed, changes in insertList integrated with RTree impl. All stuff pushed on github.