Closed pstutz closed 9 years ago
@pstutz : let me know if I can help you somehow on this.
Easy win: switch the representation of the strings from UTF-16 (java default) to UTF-8. this should save almost 50% on our strings. http://stackoverflow.com/questions/88838/how-to-convert-strings-to-and-from-utf8-byte-arrays-in-java
@jahangirmohammed i think we could move the dictionary into its own project. then build some benchmarks for different approaches, where we test performance and memory-efficiency on strings similar to the ones we need to store.
sounds like a good thing to do. under uzh ?
We should explore other things such as string interning, -XX:+UseCompressedStrings
, etc, in order to ensure we do what makes sense when handling a ton of strings on the JVM.
let's create it under iht, but open it up and apache license it from the start so we can start using it in triplerush immediately.
@jahangirmohammed i would appreciate your help on this. it means that i could focus on determining if we really measured what we think we measured in the short term, whilst in a week or two we might have better dictionaries to address any issues that we might discover.
I was trying out huffman examples to see how they perform on short strings, they seems to be not that bad.
an example of : encode("triplerush") lead to this:
0111010000111001000011010011000111000000001101100110001100101001000011101010000011100111110001101000
= 100 bits = 12.5 bytes.
whereas string in hashmap would probably take 2 * 9 = 18 bytes at least.
Not sure my maths is off. but, seems like savings.
let's find a way to compare different approaches on the kinds of strings we care about.
Done with mapdb.
Probably the best approach: https://github.com/uzh/triplerush/issues/22 But will take some time to implement.