Review usage of MapDB in HashDictionary

pstutz commented 9 years ago

This is the class that's being described here: https://github.com/uzh/triplerush/blob/master/src/main/scala/com/signalcollect/triplerush/dictionary/HashDictionary.scala

The purpose of this class is to offer a mapping of RDF IRIs/literals (think strings with lots of shared prefixes, usually between 20 and 80 characters long) to integers, and from integers back to those same strings. For now we only add to this dictionary and never delete an existing mapping.

We use MapDB to store an in-memory mapping from Int->String. For the String->Int mapping we use a hash on the UTF-8 encoded byte array. We only explicitly store the String->Int mapping when there is a collision of this hash (also in a much smaller MapDB B-tree).

For this reason the Int->String mapping is the crucial part of our usage. For this mapping we have configured the MapDB B-tree like this:

val db = DBMaker
  .memoryUnsafeDB
  .closeOnJvmShutdown
  .transactionDisable
  .asyncWriteEnable
  .asyncWriteQueueSize(32768)
  .storeExecutorEnable(Executors.newScheduledThreadPool(math.min(16, Runtime.getRuntime.availableProcessors)))
  .compressionEnable
  .make

val int2String = db.treeMapCreate("int2String")
  .keySerializer(BTreeKeySerializer.INTEGER)
  .valueSerializer(Serializer.BYTE_ARRAY)
  .nodeSize(32)
  .makeOrGet[Int, Array[Byte]]()

Compression is enabled so the strings, that usually share a lot of their prefixes, are not so large. The node size 32 was chosen as a trade-off between the read/write performance and memory usage. A value of 32 also kept GC churn lower than it was with larger values.

We are considering using the G1 GC and I wonder if it is a good fit for this kind of use case. It should result in less fragmentation than other parallel collectors. I noticed that when adding entries in parallel and at full-speed it can be hard for the GC to keep up.

Is there anything else we can do to reduce the GC pressure created by the intermediate objects?

I noticed that async writes and a pretty large queue size increased the write performance by a lot. What's the trade-off and what are the optimal values?

Do we need to compact the tree to keep the direct memory usage low over time?

Are there any other potential improvements?

Existing ideas for improvements:

Use Huffman coding for string compression
New string/byte[] comparator that works from the end of the array to the front, as many of our strings have long shared prefixes
We might be able to speed up loading with DataPump, but it would require significant preprocessing of our data

jankotek commented 9 years ago

1) enable values outside nodes

2) disable compression globally and use just on id2string value:

val int2String = db.treeMapCreate("int2String")
  .keySerializer(BTreeKeySerializer.INTEGER)
  .valueSerializer(new Serializer.CompressionDeflateWrapper(Serializer.BYTE_ARRAY))
  .nodeSize(32)
  .makeOrGet[Int, Array[Byte]]()

3) enable dictionary with previous example:

new Serializer.CompressionDeflateWrapper(Serializer.BYTE_ARRAY, LEVEL, DICTIONARY)

jankotek commented 9 years ago

decrease write queue size .asyncWriteQueueSize(32768)

jankotek commented 9 years ago

Try Koloboke LongLongMap for mapping. Key is your ID, value is Recid from MapDB Engine.

To insert new value into mapdb and get its Recid:

long recid = db.getEngine().put(value,Serializer.BYTE_ARRAY)

To get value from recid: byte[] val = db.getEngine().get(recid, Serializer.BYTE_ARRAY)

uzh / triplerush

Review usage of MapDB in HashDictionary #35