siara-cc / Unishox2

Compression for Unicode short strings (works on arduino)
Apache License 2.0
179 stars 26 forks source link

Optimize wordlist.h for Unishox3 #52

Open eos175 opened 1 year ago

eos175 commented 1 year ago

The current implementation of Unishox3 uses wordlist.h which is a header file containing arrays of strings that are used for encoding and decoding text. However, the current implementation has some inefficiencies that could be optimized.

Firstly, the current implementation calls strlen to get the length of each string in the arrays. One potential solution to this is to use std::string_view, which gets the length of the string at compile time. However, this approach can increase memory usage since there are many strings in the array.

To address this issue, the wordlist.h file can be flattened into a single string, wordlist.bin, which can be compiled into the binary. This significantly reduces the amount of memory required since there are no longer any pointers to the individual strings, and the string length is already known at compile time. To know where each string is located in the flattened string, an array of indices is used wordlist_index.h.

Additionally, this optimization can improve CPU performance since the strings are now located in a contiguous block of memory, which improves spatial locality and cache performance.

To implement this optimization, we propose the following changes:

  1. Flatten the wordlist.h file into a single string, wordlist.bin, which can be compiled into the binary.
  2. Create an array of indices, wordlist_index, to keep track of where each string starts and ends in the flattened string.
  3. Modify the getDictWord() function to use the wordlist_index array to retrieve the correct string from the flattened wordlist.bin string.
and are but for haveingjustlikenot thatthe theythiswas withyou ", "": ": "":""},.com.org</="aboutall alsobeencan coulddon'tevenfromget goodhad has her his how i'm in theit'sknowmakemoremuchof theone onlyotherout she somestillthantheirthemthentherethinktimeto ...
const int wordlist_index[] = {0, 4, 8, 12, 16, 20, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 70, ...
int getOffset(int lvl)
{
    switch (lvl) {
    case 0:
        return 0;
    case 1:
        return 16;
    case 2:
        return 16 + 64;
    case 3:
        return 16 + 64 + 256;
    case 4:
        return 16 + 64 + 256 + 2048;
    case 5:
        return 16 + 64 + 256 + 2048 + 32768;
    case 6:
        return 16 + 64 + 256 + 2048 + 32768 + 131072;
    default:
        return std::size(wordlist_index); // invalid
    }
}

extern char _binary_wordlist_bin_start[];

const char *getDictWord(int lvl, int pos, size_t *size)
{
    constexpr auto wordlist_bin = (const char *)_binary_wordlist_bin_start;

    size_t index = getOffset(lvl) + pos;
    if (index < std::size(wordlist_index) - 1) {
        auto start = wordlist_index[index];
        auto end = wordlist_index[index + 1];
        *size = end - start;
        return wordlist_bin + start;
    }

    *size = 0;
    return nullptr;
}

This optimization can also reduce the time required for compilation since the wordlist.h file is no longer required, and the compilation of the binary is faster.

We have tested this implementation and found that it reduces both memory usage and execution time compared to the original implementation.

ld -r -b binary -o wordlist_bin.o wordlist.bin
g++ main.cpp wordlist_bin.o && ./a.out
siara-cc commented 1 year ago

Hi, thank you for sharing this. I appreciate the amount of thought that went into this.

If I understand your solution correctly, it seeks to save the space occupied by the pointer, which is 8 bytes on a 64-bit machine and also the null terminator.

So assuming we only use a 4 byte integer array, the total savings per entry would be 8+1-4 = 5 bytes, which translates to 51.2m = around 6mb. Considering the size of data is 14mb, so the percentage of savings achieved is 6/(9+14)100 = ~28%. This is of course quite significant for a 64-bit machine.

I had given this a lot of thought too. I also wanted to cram the dictionary into flash memories on embedded devices which don't have fancy amount of RAM (at max 512kb). These devices don't even 14mb kind of flash memories. Most popular ones have 256kb or at max 4mb.

The solution I came up initially was similar to yours, but I planned to use a block based approach where the pointer size is only 2 bytes and I can avoid the null terminator if I use an array in each block so length of string is obtained from the position of the next string. Then I can use binary search first on the blocks and then do another binary search within the block. So the savings per entry would be 8+1-2 = 7 bytes, not considering overheads for each block. The total savings would be 7*1.2m = 8.5mb roughly.

However, this still means the dictionary size has to be at least (14+2.4+etc) = ~17mb, which is quite high, considering I plan to add more entries to the dictionary, if possible get the entire Wiktionary in, which is around 8 million entries.

I was still happy with this and was even planning a separate lib, until I searched and came across Marisa developed by Susumu Yata, which could shrink 14mb to 3.4mb, including pointer and all. However this lib is not endian compatible and can work only on esp32 at best, so I am planning to make my own with improvements hopefully bringing size down to 3mb and also make it work on other micro controllers.

If you see the present code, there is also some trouble locating the best match in the dictionary because of using binary search based design. The better approach is to use trie based design, where such feature has been already developed by the scientist in the form of "Common Prefix Search".

I am also fine tuning the dictionary here: https://github.com/siara-cc/Text_frequency_research, so this is the status of this so far.

It is nice to know that you have developed a better one. If you create a PR I will merge it, but otherwise I am attempting to save even more using the above plan.

eos175 commented 1 year ago

Regarding the idea of using int16 indices instead of pointers, it is possible to implement it. However, the dictionary for level 6 has 131,072 elements, for this reason use int and not uint16, which means that limiting the number of elements per level could lead to even more savings.

In terms of binary search vs. trie-based design, binary search can be faster on modern architectures due to caching and spatial locality. On the other hand, a tree-based design may result in more cache misses due to the abundance of pointers. However, this may not apply to MCUs where the CPU is not as fast.

siara-cc commented 1 year ago

Thank you for the PR !! I have merged it.

About int16 (uint16), I meant to divide the storing of entries in blocks of say 64k bytes, each having the cumulative number of words in the header followed by an array of uint16 pointers to entries in the block followed by actual entries. To search for an entry, a binary search is applied on the blocks to arrive at the block in which the entry is present. Then a second binary search is to be applied on the block found. This is analogous to the implementation here: https://github.com/siara-cc/sqlite_micro_logger_c/blob/401be204ce1069f33d6b19bf36435ed7a2b2f657/src/ulog_sqlite.c#LL1431C17-L1431C17 In this case, it will be slower than your implementation, but will save more bytes.

Marisa is a succinct LOUDS based Patricia trie. I think the Patricia trie based approach will have lesser cache misses than the lg2(n) worst case probability of binary search, which in this case is 20. However Marisa trie is a succinct trie with no pointers so it will have to depend on lookup tables for range() and select() queries. On the contrary, a LOUDS based trie has the property of exiting early for non-existing entries as opposed to binary search it is late-exit. Dictionary lookup for Unishox would have more fail rate than hit rate. So which approach will be faster remains to be seen. I think this change will be helpful for comparing the performance!

siara-cc commented 1 year ago

@eos175 FYI I integrated a Bloom filter and Marisa and got mixed results: