Improve memory efficiency of dictionary

uzh / triplerush

A distributed in-memory graph store.

Apache License 2.0

31 stars 11 forks source link

Improve memory efficiency of dictionary #33

Closed pstutz closed 9 years ago

pstutz commented 9 years ago

Probably the best approach: https://github.com/uzh/triplerush/issues/22 But will take some time to implement.

jahangirmohammed commented 9 years ago

@pstutz : let me know if I can help you somehow on this.

pstutz commented 9 years ago

Easy win: switch the representation of the strings from UTF-16 (java default) to UTF-8. this should save almost 50% on our strings. http://stackoverflow.com/questions/88838/how-to-convert-strings-to-and-from-utf8-byte-arrays-in-java

pstutz commented 9 years ago

@jahangirmohammed i think we could move the dictionary into its own project. then build some benchmarks for different approaches, where we test performance and memory-efficiency on strings similar to the ones we need to store.

jahangirmohammed commented 9 years ago

sounds like a good thing to do. under uzh ?

pstutz commented 9 years ago

We should explore other things such as string interning, -XX:+UseCompressedStrings, etc, in order to ensure we do what makes sense when handling a ton of strings on the JVM.

pstutz commented 9 years ago

let's create it under iht, but open it up and apache license it from the start so we can start using it in triplerush immediately.

pstutz commented 9 years ago

@jahangirmohammed i would appreciate your help on this. it means that i could focus on determining if we really measured what we think we measured in the short term, whilst in a week or two we might have better dictionaries to address any issues that we might discover.

jahangirmohammed commented 9 years ago

I was trying out huffman examples to see how they perform on short strings, they seems to be not that bad.

an example of : encode("triplerush") lead to this: 0111010000111001000011010011000111000000001101100110001100101001000011101010000011100111110001101000 = 100 bits = 12.5 bytes. whereas string in hashmap would probably take 2 * 9 = 18 bytes at least.

Not sure my maths is off. but, seems like savings.

pstutz commented 9 years ago

let's find a way to compare different approaches on the kinds of strings we care about.

jahangirmohammed commented 9 years ago

https://github.com/iHealthTechnologies/kosh

pstutz commented 9 years ago

Done with mapdb.