Closed UweSchmidt closed 10 years ago
this was introduced in 323af68f93070c6954314c7a2d92081a60e12315
Hunt.Common.DocIdSet.union
seems to be much much faster than mapToSet $ Occ.merge (setToMap s1) (setToMap s2)
.
mapToSet
and settToMap
are implemented with toList
+ fromList
conversions
@chrisreu: which bug did you fixed there?
I think this might have happened when i introduced the IndexValue
type.
The implementation is not ideal right now and i intended to refactor that, but run out of time then and wasn't aware of the performance issues i introduced with the temporary solution - sorry for that.
I am pretty sure the problem is either the deeseq
call there, which i just removed, or the IndexValue
instance for the DocIdSet
. This implementation basically converts every value from DocIdSet
to Occurrence
and performs the diff or merge on the resulting Occurrences
. The only reason for that really was, that i had no internet access and wasn't able to access hackage for documentation.
@sebastian-philipp
And yes you are right, if fromList
and toList
are used, that might be a problem as well. But that would only hold for the ContextType
s using DocIdSet
as values.
Edit: Haha, i just realized that i wrote the settToMap
and mapToSet
functions. So - yes - that was not a good idea.
I believe the current Hayoo Contexttype are not using this yet, right? I believe the missplaced deepseq was the problem. Can anyone verifiy that?
We looked into it again and it seems like the DocIdSet
instances actually are used in Hayoo for the Date ContextType. So the IndexType
implementation for DocIdSet
is the issue.
I'll fix that as soon as i'm out of the campgrounds.
The abstraction form the fixed Occurrences
type to a variable IndexValue
is a critical step. The performance of Hunt is based on very efficient operations on the Occurrences
type. For an abstraction form that type we have to analyze, what operations are currently performed on occurrences, and these have to be placed one-to-one as implementations into an instance of a class IndexValue
.
I'm not sure, whether the current design is the ideal one. The conversion functions seem to introduce some overhead, and IntermediateValue
is declared as data
, not as newtype
.
The current status is a real showstopper for Hayoo and Hunt. Therefore I would prefer to experiment with the abstraction from Occurrences to IndexValues in an extra git branch, not in the master branch.
For Hayoo we can live with the unflexible fixed Occurrence solution, in the long run the IndexValue is of cause the better approach (as long as there are no performance issues). And an extra branch for that gives us time to experiment with a clean and efficient IndexValue
design without breaking our running system.
I've added a branch no-IndexValue
, that starts with commit 30ecccf
and contains those commits not changing the index types of Hunt. With this branch indexing Hayoo is running again.
I developed those changes in an extra branch, but then merged it after it appeared to be good. Since i'm not currently able to build Hayoo on my machine i only checked the smaller jokes dataset. With that dataset the added insertion time did unfortunately not show.
I removed the IndexValue
instance for DocIdSet
for now, until i find a working implementation.
I still kept the IntermediateValue
abstraction, but turned it into a newtype declaration.
Maybe someone could check if this fix makes Hunt work again for Hayoo with the master-branch?
Your commit is working for a subset of Hayoo. It is as fast as the last commit without the performance issue.
I've added a small benchmark to verify the performance of the "textSimple" 'ContextType' and the 'IndexValue' / 'IntermediateValue' changes in general. I think we do not need the separate branch anymore to build hayoo.
Benchmarks show similar or better performance for insert (with unions) and deletes (with diffs). It uses less memory as well. The Benchmark is checked in with my latest commit. I uploaded the results here: http://huntsearch.org/bench/index.html
The date, geo and numeric indexes are using this implementation per default, since positions are not needed there anyway. I believe, if we use this context type in hayoo for certain contexts that do need phrase search, we could reduce the memory footprint.
I've tested the master branch with the indexer moved from Holumbus to Hunt, and it worked fine, no diff to the old running version, so the branch is obsolete.
Help! Performance bug in
ContextIndex
: When building the Hayoo index, insertion of docs has become so slow, that's impossible to build the index. When inserting a bunch of documents withinsertList
the server does not respond anymore. But it's not an infinite loop. Extra trace output shows, the time is spend inContextIndex.inserList
. After inserting a few Haddock pages, e.g. for 50 packages, insertion of a single function lasts about half a second, after a few packages more, it's longer than a second.It's not a memory problem or space leak, space doesn't grow.
There was a refactoring of the
ContextIndex
. Could that be a possible reason?