Open timdream opened 8 years ago
Let's focus on the string part of the data first.
Here are some ideas after reading some article about punycode and delta compression: Around 50% of the characters following the first character has a delta smaller than 0xff, so if we store the delta rather than the actual USC-2 code, we could reduce the size string part of the data by 75%.
We can't compress the float32 numbers on the scores unless we want to implement float16 ourselves in JavaScript, nor we (or, I) could compress the index tables with my limit knowledge of data structure etc.
One problem here is now about how to fit the variable length string(s) into the data pack while keeping the implicit information we keep by leverage the pretty double-byte alignment we have today. We don't keep the byte length of the data ourselves (actually, the number kept is # of bytes / 2), and we use it to figure out # of items in the data pack. There are waste bits -- 12 bits to be exact, in the first 16 bits of "control bytes" we keep. A quick check shows the longest list has 132 items -- this is something can be saved in 8 bits.
The other problem would be the fact that we need some way to tell if the next byte is a delta or not. Alternatively, if all bytes follows are deltas, we would need to be able to tell # of bits used for the next delta -- they can't be all 16 bits, if so we ended up with a simple delta encoding, rather than a delta compression.
Just for recording keeping of my current investigation.
Around 50% of the characters following the first character has a delta smaller than 0xff, so if we store the delta rather than the actual USC-2 code, we could reduce the size string part of the data by 75%.
This is wrong. My math was wrong. Only 6.7% of the characters falls into there. So this might not work...
Investigate if there are compression opportunity for the binary store.