hunt-framework / hunt

A flexible, lightweight search platform
59 stars 10 forks source link

Performance Bug in ContextIndex #93

Closed UweSchmidt closed 10 years ago

UweSchmidt commented 10 years ago

Help! Performance bug in ContextIndex: When building the Hayoo index, insertion of docs has become so slow, that's impossible to build the index. When inserting a bunch of documents with insertList the server does not respond anymore. But it's not an infinite loop. Extra trace output shows, the time is spend in ContextIndex.inserList. After inserting a few Haddock pages, e.g. for 50 packages, insertion of a single function lasts about half a second, after a few packages more, it's longer than a second.

It's not a memory problem or space leak, space doesn't grow.

There was a refactoring of the ContextIndex. Could that be a possible reason?

sebastian-philipp commented 10 years ago

this was introduced in 323af68f93070c6954314c7a2d92081a60e12315

sebastian-philipp commented 10 years ago

Hunt.Common.DocIdSet.union seems to be much much faster than mapToSet $ Occ.merge (setToMap s1) (setToMap s2).

mapToSet and settToMap are implemented with toList+ fromList conversions

sebastian-philipp commented 10 years ago

@chrisreu: which bug did you fixed there?

chrisreu commented 10 years ago

I think this might have happened when i introduced the IndexValue type.

The implementation is not ideal right now and i intended to refactor that, but run out of time then and wasn't aware of the performance issues i introduced with the temporary solution - sorry for that.

I am pretty sure the problem is either the deeseq call there, which i just removed, or the IndexValue instance for the DocIdSet. This implementation basically converts every value from DocIdSet to Occurrence and performs the diff or merge on the resulting Occurrences. The only reason for that really was, that i had no internet access and wasn't able to access hackage for documentation.

chrisreu commented 10 years ago

@sebastian-philipp And yes you are right, if fromList and toList are used, that might be a problem as well. But that would only hold for the ContextTypes using DocIdSet as values. Edit: Haha, i just realized that i wrote the settToMap and mapToSet functions. So - yes - that was not a good idea.

I believe the current Hayoo Contexttype are not using this yet, right? I believe the missplaced deepseq was the problem. Can anyone verifiy that?

https://github.com/hunt-framework/hunt/blob/master/hunt-searchengine/src/Hunt/Common/IntermediateValue.hs#L63

chrisreu commented 10 years ago

We looked into it again and it seems like the DocIdSet instances actually are used in Hayoo for the Date ContextType. So the IndexType implementation for DocIdSet is the issue.

I'll fix that as soon as i'm out of the campgrounds.

UweSchmidt commented 10 years ago

The abstraction form the fixed Occurrences type to a variable IndexValue is a critical step. The performance of Hunt is based on very efficient operations on the Occurrences type. For an abstraction form that type we have to analyze, what operations are currently performed on occurrences, and these have to be placed one-to-one as implementations into an instance of a class IndexValue.

I'm not sure, whether the current design is the ideal one. The conversion functions seem to introduce some overhead, and IntermediateValue is declared as data, not as newtype.

The current status is a real showstopper for Hayoo and Hunt. Therefore I would prefer to experiment with the abstraction from Occurrences to IndexValues in an extra git branch, not in the master branch. For Hayoo we can live with the unflexible fixed Occurrence solution, in the long run the IndexValue is of cause the better approach (as long as there are no performance issues). And an extra branch for that gives us time to experiment with a clean and efficient IndexValue design without breaking our running system.

UweSchmidt commented 10 years ago

I've added a branch no-IndexValue, that starts with commit 30ecccf and contains those commits not changing the index types of Hunt. With this branch indexing Hayoo is running again.

chrisreu commented 10 years ago

I developed those changes in an extra branch, but then merged it after it appeared to be good. Since i'm not currently able to build Hayoo on my machine i only checked the smaller jokes dataset. With that dataset the added insertion time did unfortunately not show.

I removed the IndexValue instance for DocIdSet for now, until i find a working implementation.

I still kept the IntermediateValue abstraction, but turned it into a newtype declaration.

Maybe someone could check if this fix makes Hunt work again for Hayoo with the master-branch?

sebastian-philipp commented 10 years ago

Your commit is working for a subset of Hayoo. It is as fast as the last commit without the performance issue.

chrisreu commented 10 years ago

I've added a small benchmark to verify the performance of the "textSimple" 'ContextType' and the 'IndexValue' / 'IntermediateValue' changes in general. I think we do not need the separate branch anymore to build hayoo.

Benchmarks show similar or better performance for insert (with unions) and deletes (with diffs). It uses less memory as well. The Benchmark is checked in with my latest commit. I uploaded the results here: http://huntsearch.org/bench/index.html

The date, geo and numeric indexes are using this implementation per default, since positions are not needed there anyway. I believe, if we use this context type in hayoo for certain contexts that do need phrase search, we could reduce the memory footprint.

UweSchmidt commented 10 years ago

I've tested the master branch with the indexer moved from Holumbus to Hunt, and it worked fine, no diff to the old running version, so the branch is obsolete.