Memory profiling

chrisreu commented 10 years ago

This Issue is about the question whether ByteString serialization is a good or a bad thing. We really had a lot of trouble with it and despite certain optimizations, it seems not to improve the overall index size.

Setup: I always used the same stack of indexes.

newtype InvertedIndex _v = InvIx { invIx :: KeyProxyIndex Text ComprOccPrefixTree CompressedOccurrences }

I only exchanged the CompressedOccurrences type.

Test1: 3000 documents a 200 words (25 MB JSON)

I compared the serialized files on disk.

Compression	Size in disk	Size in memory
no compression	49 MB	260 MB
only bytestring	49,1 MB	83 MB
bzip bytestring	27,1 MB	73 MB

Conlusion

The test are run with profiling and the profling results look okay. Seems like the extra compression makes sense after all.

While executing this benchmark a strictness bug regarding lazy Occurrences was fixed.

Profiling results verify the benchmark. Profiled values fit to what the responding memory footprint show.

ByteString without further compression is only a bit bigger than the compressed ones. This makes sense, since we deal with small values here. The conversion to bytestring takes more time then the compression afterwards as well.

The uncompressed results is as big as it is, because of the Hashed DocID within the IntMap. The position within the IntSet are not a factor.

chrisreu commented 10 years ago

Same benchmark with complete Hayoo index:

Compression	Size in disk	Size in memory
no compression without intsetcache	379 MB	2 GB
no compression with Cache	379 MB	1,7 GB
no compression with Cache and BTree	379 MB	1,1 GB
only bytestringShort	386 MB	890 MB
bzip bytestringShort	286 MB	780 MB
bzip bytestringShort with BTree	285 MB	755 MB

chrisreu commented 10 years ago

The previous results are not really suitable for using them within the thesis. That's why this new Benchmark is done.

The Benchmark

The benchmark uses 2 contexts. This way the effect of the batch operations on multiple cores can be seen. The benchmark is run with 2 cores.

We are using a random data set in this benchmark containing of 3000 documents. Each document inserts 300 words into two contexts and 400 into another two. The words are randomly generated from a dictionary of about 3500 words.

No values are stored within the document table. This way we measure only the real index size ->show benchmark code<-

The Results

todo:

another run on an not empty index -> compare try to build benchmark for compelx "phrase" query, since this is the only one really influences by the compressed Occurrences (only one using the positions). maybe good results could be achived

do benchmarks for alternate data structures as well ... but... not sure how to argue their usage, since bytestring is ultra fast now ...

Simple Occurrences on one core

1core

Index	Memory	Insert	BatchInsert	Delete	BatchDelete
no compression	285 MB	45,5s	28,5s	2.3	0.16s
only bytestringShort	80 MB	405s	30s	128s	2.3s
bzip bytestringShort	42 MB	3163s	48s	1178s	24s

4cores

Index	Memory	Insert	BatchInsert	Delete	BatchDelete
no compression	285 MB	11,7s	9s	0.7	0.075s
only bytestringShort	80 MB	132s	9.6s	37,2s	0.7s
bzip bytestringShort	42MB	1182,6s	16,7s	545,8	11.2

Optimizing the uncompressed

Index	Memory	Insert	BatchInsert	Delete	BatchDelete
inset cache	-	-	-	-	-
cache with bintree	-	-	-	-	-

hunt-framework / hunt