mitiko / BWDPerf

BWD stands for Best Word Dictionary as it has the ability to be an optimal dictionary coder.
https://mitiko.github.io/BWDPerf
GNU General Public License v3.0
0 stars 1 forks source link

Decompression doesn't work #6

Closed mitiko closed 3 years ago

mitiko commented 3 years ago

Decompression of the compressed book111 file (which is the first 7687 bytes of book1 from the Calgary corpus, because compression is otherwise too slow to test on bigger files (see #5)) is incomplete. The bits read don't cover the whole file and the decompressor ends up trying to decompress a second dictionary, whose dictionary size, though, is way too big to fit in memory, so an out of memory exception is thrown.

Upon further investigation, it seems the encoder correctly computes a dictionary but the stream is then incorrectly split into the chosen words and stoken. I identified that because the encoded stream, as decompressed by myself by hand contains the word " ing", which does not appear at the very beginning of book1.

So maybe check the EncodeStream method in BWDEncoder.cs?

mitiko commented 3 years ago

This line of code seemed suspicious: https://github.com/Mitiko/BWDPerf/blob/6a65d609a9b05bbf038b356da516abcbdd4a7c22/src/Common/Algorithms/BWD/BWDEncoder.cs#L293-L294 The !match is actually not needed, but what this does is ensure we don't fall into an infinite loop. I suspect I added it, for when I was testing early stopping of the dictionary calculation - forcing the word to be an stoken if the ranks start going negative.

Anyway, I removed it and changed some other stuff too, not sure if it was a cause, but the encoder now pushes the correct indices and the output contains the correct stoken data.

Therefore, my best guess now is that the DictionaryToBytes transform writes the bits incorrectly. And judging from what the decoder reads from the compressed file, it also doesn't read bits correctly. That makes sense because I mostly copied the code from the DictionaryToBytes transform.

mitiko commented 3 years ago

So... the DictionaryToBytes correctly transforms the integers to bits in the bit queue and it correctly writes them in packs of 8 to the output. This makes the bit manipulation part of the decoder the prime suspect.

mitiko commented 3 years ago

Turns out I wasn't serializing correctly for an IAsyncEnumerable<byte>. It wasn't the problem, but I fixed it for now. Currently, it doesn't buffer the data and that probably slows it down, but I'll fix it when it becomes an issue.

The bit manipulation code in the decoder was a problem because I was starting my loop from 8, not 7: https://github.com/Mitiko/BWDPerf/blob/93fcab312f38bfc2c130d6cb5e1a0a22bf864b1d/src/Common/Algorithms/BWD/BWDDecoder.cs#L123-L124 Anyway, I fixed that. I also fixed reading from the s token. Or at least, it's not an obvious cause of the decompression issues: https://github.com/Mitiko/BWDPerf/blob/93fcab312f38bfc2c130d6cb5e1a0a22bf864b1d/src/Common/Algorithms/BWD/BWDDecoder.cs#L101-L113 Currently, the first 2 lines of the decompressed file and the original are identical, but then data from words get all scrambled up. From the looks of it, the s token is written out correctly and in order, and in place (at least for the first 2-3 lines after the identical ones). But then I can see 'outh' as part of 'mouth' repeated again in between words, even though the original doesn't have a second 'outh'. I was thinking it could be that we're writing words multiple times, not incrementing some count.

mitiko commented 3 years ago

Ok, we're basically done now. The decompressor handles everything perfectly, no problems in the splitting or choosing words either. The issue is with the greediness of the dictionary. We have to eliminate all words from the buffer before we move on to the next word. Or said simply, even though a word might match the string, we can't emit its index because a word, before it in the dictionary can exist forward in the buffer and eliminate it. And then when the s token gets mixed up, the whole file gets screwed.

mitiko commented 3 years ago

Decompression works!