hunt-framework / hunt

A flexible, lightweight search platform
59 stars 10 forks source link

Memory profiling #44

Closed chrisreu closed 10 years ago

chrisreu commented 10 years ago

Memory profiling

This Issue is about the question whether ByteString serialization is a good or a bad thing. We really had a lot of trouble with it and despite certain optimizations, it seems not to improve the overall index size.

Setup: I always used the same stack of indexes.

newtype InvertedIndex _v = InvIx { invIx :: KeyProxyIndex Text ComprOccPrefixTree CompressedOccurrences }

I only exchanged the CompressedOccurrences type.

Test1: 3000 documents a 200 words (25 MB JSON)

I compared the serialized files on disk.

Compression Size in disk Size in memory
no compression 49 MB 260 MB
only bytestring 49,1 MB 83 MB
bzip bytestring 27,1 MB 73 MB

Conlusion

The test are run with profiling and the profling results look okay. Seems like the extra compression makes sense after all.

While executing this benchmark a strictness bug regarding lazy Occurrences was fixed.

Profiling results verify the benchmark. Profiled values fit to what the responding memory footprint show.

ByteString without further compression is only a bit bigger than the compressed ones. This makes sense, since we deal with small values here. The conversion to bytestring takes more time then the compression afterwards as well.

The uncompressed results is as big as it is, because of the Hashed DocID within the IntMap. The position within the IntSet are not a factor.

chrisreu commented 10 years ago

Same benchmark with complete Hayoo index:

Compression Size in disk Size in memory
no compression without intsetcache 379 MB 2 GB
no compression with Cache 379 MB 1,7 GB
no compression with Cache and BTree 379 MB 1,1 GB
only bytestringShort 386 MB 890 MB
bzip bytestringShort 286 MB 780 MB
bzip bytestringShort with BTree 285 MB 755 MB
chrisreu commented 10 years ago

The previous results are not really suitable for using them within the thesis. That's why this new Benchmark is done.

The Benchmark

The benchmark uses 2 contexts. This way the effect of the batch operations on multiple cores can be seen. The benchmark is run with 2 cores.

We are using a random data set in this benchmark containing of 3000 documents. Each document inserts 300 words into two contexts and 400 into another two. The words are randomly generated from a dictionary of about 3500 words.

No values are stored within the document table. This way we measure only the real index size ->show benchmark code<-

The Results

todo:

another run on an not empty index -> compare try to build benchmark for compelx "phrase" query, since this is the only one really influences by the compressed Occurrences (only one using the positions). maybe good results could be achived

do benchmarks for alternate data structures as well ... but... not sure how to argue their usage, since bytestring is ultra fast now ...

Simple Occurrences on one core

1core

Index Memory Insert BatchInsert Delete BatchDelete
no compression 285 MB 45,5s 28,5s 2.3 0.16s
only bytestringShort 80 MB 405s 30s 128s 2.3s
bzip bytestringShort 42 MB 3163s 48s 1178s 24s

4cores

Index Memory Insert BatchInsert Delete BatchDelete
no compression 285 MB 11,7s 9s 0.7 0.075s
only bytestringShort 80 MB 132s 9.6s 37,2s 0.7s
bzip bytestringShort 42MB 1182,6s 16,7s 545,8 11.2

Optimizing the uncompressed

Index Memory Insert BatchInsert Delete BatchDelete
inset cache - - - - -
cache with bintree - - - - -