Finding matches is too slow. Memory efficient but slow.

mitiko commented 3 years ago

Compared to my previous solution, memory dropped by a factor of 10 for enwik6. Since finding a mistake in the way we count words (not taking into account translational symmetry) speed shot down by a lot. Find a way to make it go faster.

Just an idea is to make the default values of the WordsRef table be the index.

Maybe try to use more memory for the selection of words and find a good balance in the tradeoff.

mitiko commented 3 years ago

There are some ideas for using a better string searching algorithm than the naive approach. Knuth-Morris-Pratt, Boyer-Moore, and others.

There also exists a possibility to calculate the suffix tree of the buffer and use an FM-index to backward search. The advantage is we can drop back to BWT for near-incompressible files.

mitiko commented 3 years ago

I read the brief on KMP and similar string search optimizations. None of those will lower the computational complexity from O(n^2).

Only the FM-index which is O(n) creation and O(m) search for a pattern of size m. Note that actually an FM-index does not work very well with patterns like regex, by "pattern" we imply the specific word, searched for.

I actually heard some O(n log n) creations of the suffix array could actually be faster for most data sources than the O(n) algorithm. I'll start with the easier one to implement and I might try the better one later.

I'm aiming at some Fm-index 2 implementation, using the very efficient Wavelet trees, optimized for bigger alphabets.
Some links: FM-index explanation FM-index wiki page Wavelet tree wiki page

mitiko commented 3 years ago

Thinking about the wordRef array in general, it seems it'll be too big to fit into memory for bigger files if we enocde the whole file as one block. Imagine we're working on enwik9, as one does, the wordRef array stores one integer for every byte for every word length. This means that the memory usage will be (32/8)*m and for a good dictionary, m might be in the sorts of 24.
This means we need at least 40x memory than the file itself, which for enwik9 is 40GB.

My machine only has 8GB and the Hutter prize requirements are 10GB, so we need to do something about that.

If we want to preserve this structure, we can write it to a file, since it's mostly cold data - we're only reading and writing at specific locations - where a word was chosen. But converting integers to bytes and bytes to integers every time we need to read from there seems expensive.
The other way is to use another method for counting words, but yet still, on encoding, we're looking at the data globally unless maybe we look ahead in data to find better words while still using the old linear parsing method. Like if we match a word, we don't immediately emit it, we check if a better word (one that occurs before the current word in the ordered dictionary) starts anywhere in the current selection and would otherwise cover characters in that selection, making the initial guess useless. Then this happens recursively until no better word starts in the selection. (guaranteed to end, since we have a limited amount of words).

mitiko commented 3 years ago

How does wordRef actually help us? What is its purpose?
It basically provides us with a state of which words have been used and a base to count the occurrences, as the count is an important metric in the ranking. So it achieves 2 things - faster state update (faster than keeping a list of continuous regions) and fast simple re-counting of words on update.

Now that I think about it, we can keep a list of continuous regions like before but have that be a list of Span<byte> so we aren't actually copying any bytes, just references, managed pointers.
The re-counting seems tricky tho. Especially for words that have the same byte content but are at different offsets - the decoder would count as different words. Even if we somehow find a good way to do that, then the counts will be stored in a hash table and there's still about at least fileSize/2 distinct words times 24 for each m, making it 12*fileSize. Even if my rough calculations are too broad, we'll still be pretty close to the memory limit.

The other solution is to leave ranking altogether and make good approximations for the vertex cover of our graph.

mitiko commented 3 years ago

We've now implemented the suffix array but we don't want to throw away our original buffer, as we need to do parsing on it later.

We can use binary search on the SA for a PoC and then implement the whole FM-index by storing BWT and answering rank queries with a wavelet tree.

mitiko commented 3 years ago

So... I've done the binary search for SA and I helped BWD use it but the tests are not succeeding.
There seems to be a problem in how the decoder decodes the <s> token index.
First of all, I've written incredibly bad code for decoding that index; it's pretty inconsistent.

But if that's what's causing the problems, why didn't previous tests catch it? This means that either this isn't the issue or this is an issue and the parsing has changed. This means I must've not correctly done parsing before or the SA search doesn't work.

It is also pretty strange that the compressed file increases. I should check the difference between wordRef in the previous parsing implementation and the current one.

mitiko commented 3 years ago

I was doing the binary search wrong and there was also a problem in how search results were used. We obviously need to sort them first and we can't assume the last search was a match, the last match needs to be saved.

Anyway, fixed that. Time for a benchmark

mitiko commented 3 years ago

Uh uh, baby, we have to fix the SA creation first. Let's fill the overflow of the shift with zeroes instead of cycling back.

mitiko commented 3 years ago

Here are the benchmark results of the version with SA:


BenchmarkDotNet=v0.12.1, OS=ubuntu 18.04
Intel Core i5-7200U CPU 2.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.103
  [Host]   : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT
  ShortRun : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Compress	2,803.483 ms	4,026.7930 ms	220.7221 ms	1226000.0000	2000.0000	1000.0000	1908300.7 KB
Decompress	2.151 ms	0.3771 ms	0.0207 ms	82.0313	-	-	126.84 KB

mitiko commented 3 years ago

Here are the results before SA:


BenchmarkDotNet=v0.12.1, OS=ubuntu 18.04
Intel Core i5-7200U CPU 2.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.103
  [Host]   : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT
  ShortRun : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Compress	4,387.584 ms	17,130.3067 ms	938.9699 ms	1224000.0000	2000.0000	1000.0000	1904063.12 KB
Decompress	2.094 ms	0.5860 ms	0.0321 ms	82.0313	-	-	126.84 KB

mitiko commented 3 years ago

Those are for enwik4 with r=8 and m=16 Let's see enwik5, same parameters: With SA:


BenchmarkDotNet=v0.12.1, OS=ubuntu 18.04
Intel Core i5-7200U CPU 2.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.103
  [Host]   : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT
  ShortRun : .NET Core 5.0.3 (CoreCLR 5.0.321.7203, CoreFX 5.0.321.7203), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Compress	30,454.31 ms	274,956.56 ms	15,071.297 ms	23,713.83 ms	7863000.0000	4000.0000	2000.0000	11848.52 MB
Decompress	23.38 ms	69.88 ms	3.830 ms	21.77 ms	968.7500	-	-	1.46 MB

Without SA: Takes more than 160 s/op, I don't have to run all tests

mitiko / BWDPerf

Finding matches is too slow. Memory efficient but slow. #5