Closed chrisreu closed 10 years ago
Same benchmark with complete Hayoo index:
Compression | Size in disk | Size in memory |
---|---|---|
no compression without intsetcache | 379 MB | 2 GB |
no compression with Cache | 379 MB | 1,7 GB |
no compression with Cache and BTree | 379 MB | 1,1 GB |
only bytestringShort | 386 MB | 890 MB |
bzip bytestringShort | 286 MB | 780 MB |
bzip bytestringShort with BTree | 285 MB | 755 MB |
The previous results are not really suitable for using them within the thesis. That's why this new Benchmark is done.
The benchmark uses 2 contexts. This way the effect of the batch operations on multiple cores can be seen. The benchmark is run with 2 cores.
We are using a random data set in this benchmark containing of 3000 documents. Each document inserts 300 words into two contexts and 400 into another two. The words are randomly generated from a dictionary of about 3500 words.
No values are stored within the document table. This way we measure only the real index size ->show benchmark code<-
todo:
another run on an not empty index -> compare try to build benchmark for compelx "phrase" query, since this is the only one really influences by the compressed Occurrences (only one using the positions). maybe good results could be achived
do benchmarks for alternate data structures as well ... but... not sure how to argue their usage, since bytestring is ultra fast now ...
1core
Index | Memory | Insert | BatchInsert | Delete | BatchDelete |
---|---|---|---|---|---|
no compression | 285 MB | 45,5s | 28,5s | 2.3 | 0.16s |
only bytestringShort | 80 MB | 405s | 30s | 128s | 2.3s |
bzip bytestringShort | 42 MB | 3163s | 48s | 1178s | 24s |
4cores
Index | Memory | Insert | BatchInsert | Delete | BatchDelete |
---|---|---|---|---|---|
no compression | 285 MB | 11,7s | 9s | 0.7 | 0.075s |
only bytestringShort | 80 MB | 132s | 9.6s | 37,2s | 0.7s |
bzip bytestringShort | 42MB | 1182,6s | 16,7s | 545,8 | 11.2 |
Index | Memory | Insert | BatchInsert | Delete | BatchDelete |
---|---|---|---|---|---|
inset cache | - | - | - | - | - |
cache with bintree | - | - | - | - | - |
Memory profiling
This Issue is about the question whether ByteString serialization is a good or a bad thing. We really had a lot of trouble with it and despite certain optimizations, it seems not to improve the overall index size.
Setup: I always used the same stack of indexes.
newtype InvertedIndex _v = InvIx { invIx :: KeyProxyIndex Text ComprOccPrefixTree CompressedOccurrences }
I only exchanged the CompressedOccurrences type.
Test1: 3000 documents a 200 words (25 MB JSON)
I compared the serialized files on disk.
Conlusion
The test are run with profiling and the profling results look okay. Seems like the extra compression makes sense after all.
While executing this benchmark a strictness bug regarding lazy Occurrences was fixed.
Profiling results verify the benchmark. Profiled values fit to what the responding memory footprint show.
ByteString without further compression is only a bit bigger than the compressed ones. This makes sense, since we deal with small values here. The conversion to bytestring takes more time then the compression afterwards as well.
The uncompressed results is as big as it is, because of the Hashed DocID within the IntMap. The position within the IntSet are not a factor.