Closed UweSchmidt closed 10 years ago
Things get better. With the above sketched modification of insertList insertion works a little bit more. If we insert 3 packages with a single json command, only a single package is found when looking for type:package. But when these packages are added separately wtih 3 insert commands we get 3 results when searching for type:package.
Is there anything wrong with the new parallel insertion? with batchAddWordsM?
This bug was probably introduced with the recent change that moved the compressed indexes into a separate module.
The implementation does not use the ComprOccPrefixTree
anymore, which merges the Occurrences
, but the old DmPrefixTree
implementation.
An easy fix would be to either
ComprOccPrefixTree
and Occurrences.Compression
back to the main package and use the OccCompression Occurrences
instance (identity) orComprOccPrefixTree
to DmPrefixTree
, remove the compressOcc
/decompressOcc
functions and replace the OccCompression v
constraint in ICon
with v ~ Occurrences
.I did not check batchAddWordsM
but I might have a coding environment set up this weekend and have a look at it.
Changes for insertList
are done (locally, not yet on github), also in hunt-compression.
There is still an old batchAddWords
, I will try that for localizing the error.
A cause for this bug might be some code duplication between Index implementations for basic data structures and I think that I've introduced the same bug in the Rtree-Index again.
We have 5 instances:
and 3 of these 5 Instances are buggy regarding the merge of Occurrences in insertList.
The bug is located.
It sits in the last line of this code part
from Hunt.Index.PrefixTreeIndex
instance Index DmPrefixTree where
...
insertList op kvs (DmPT pt) =
mkDmPT $ SM.unionWith op pt (SM.fromList kvs)
insertList
is called with a list of pairs (words, occurrences),
but the words are not always different, e.g. this list
[("package", occ1), ("package", occ2)]
is inserted, and this
leads to a single entry for package
with occ2 as value due to
SM.fromList
where the 2. pair wins.
This bug will be crushed soon.
Bug with lost docIds removed, changes in insertList integrated with RTree impl. All stuff pushed on github.
In a search for the occurrences of a word, just the 1. document indexed containing this word is found. The reason is, that the Occurrence maps always contain a single document id. This happens because insertList does a simple union on the words in a StringMap and the 1. arg wins, e.g. for PrefixTree instance:
here a unionWith must be called with merge from Occurences.
insertList acts like unionWith in the Index Class. So it must get a merge op as parameter like unionWith. So all calls of insertList for inserting words must be changed.