Heavy-weight std::string wrapper that compresses its content and decompresses when used, accelerated with caching.
Uses single-char transition dictionary prediction to compress data. Optionally enables Huffman Encoding and Run Length Encoding to optimize compression. Fast decompression.
std::string s1;
std::getline(std::ifstream("opencl.hpp"), s1, '\0'); // 326kB (a big header file from an api)
using UnsignedIntegerPrefixType = size_t;
CompressedStringLib::PredictorString<UnsignedIntegerPrefixType> pstr(s1); // 246kB
std::string output = pstr.string(); // 326 kB again
// 126kB of aaaaaa (and 2x decoding speed because of predictable data)
CompressedStringLib::PredictorString<UnsignedIntegerPrefixType> pstr2(std::string(1024*1024,'a'));
// ~5 nanoseconds sampling latency with caching
// 14 miliseconds without caching (or with cache-miss), not optimized yet
unsigned char c = pstr2[1024*1024*2];
Tests on compiler g++ with flags -march=native -mavx -m64 -O3 -std=c++1y:
Encoding: 80 MB/s single core
Decoding: 160 MB/s single core
Encoding: 246 MB/s single core
Decoding: 252 MB/s single core
Encoding: 280 MB/s
Decoding: 270 MB/s
Encoding: 516MB/s
Decoding: 426MB/s
Encoding: 280 MB/s
Decoding: 290 MB/s
Encoding: 340 MB/s
Decoding: 240 MB/s
Encoding: 212 MB/s (if each thread does a different string encoding, then 12GB/s total)
Decoding: 194 MB/s
Uses Huffman-Encoding to compress data. Fast indexing into compressed data.
std::string stri;
std::getline(std::ifstream("opencl.hpp"), stri, '\0'); // 326kB (a big header file from an api)
// keeps Huffman Tree to decode later
CompressedStringLib::HuffmanString test(stri,false,256); // 212kB
// makes internal Huffman tree data serialized to save on disk or other storage
auto ser = test.serialize();
// later loaded from another instance
CompressedStringLib::HuffmanString test2;
test2.deserializeFrom(ser);
std::cout<<test2.string().size()<<std::endl; // 326 kB again
std::cout<<ser.size()<<std::endl; // 212 kB
// 640 kB and 4x decoding speed due to having too many "a" compared to just 1 "b"
// 256 chars cached
CompressedStringLib::HuffmanString test3(std::string(1024*1024*4,'a') + std::string("b"),false,256);
// bigFile: 4MB of "a"
CompressedStringLib::HuffmanString test5(bigFile,false,256); // 540kB
// ~5 nanoseconds sampling latency with caching
// 35 microseconds without caching (or with cache-miss), optimized for indexing
unsigned char c = test5[1024*1024*2];
Encoding: 65 MB/s single core (150 MB/s for 1MB of "a")
Decoding: 30 MB/s single core (140 MB/s for 1MB of "a")
(FX8150 3.6GHz CPU + single channel 1333MHz RAM + Ubuntu 18.04LTS + turbo disabled)
(Tested on compiler g++ with flags -march=native -mavx -m64 -O3 -std=c++1y)
By default, PredictorString does not enable Huffman Encoding for speed. With CompressedStringLib::PredictorString<UnsignedIntegerPrefixType>::OPTIMIZE_WITH_HUFFMAN_ENCODING
parameter given to its constructor, it further compresses the internal char-literal data to save some more space (despite allocating some memory for Huffman Tree & its lookup tables).
std::string str;
std::getline(std::ifstream("opencl.hpp"), stri, '\0'); // 326kB
// internal data (only compressed char-literals, not prefixes) is compressed further by Huffman Encoding
// 173 kB
// 32 MB/s decoding bandwidth (just a bit higher than Huffman Encoding alone due to already compressed data size)
using UnsignedIntegerPrefixType = size_t;
CompressedStringLib::PredictorString<UnsignedIntegerPrefixType> pstr(
str,
2048,
CompressedStringLib::PredictorString<size_t>::OPTIMIZE_WITH_HUFFMAN_ENCODING
);
auto extracted = pstr.string(); // 326 kB
Some compression ratios for other test data:
Htop mem measurement:
#include "CompressStringLib.h"
#include<fstream>
int main()
{
using UnsignedIntegerPrefixType = size_t;
CompressedStringLib::PredictorString<UnsignedIntegerPrefixType> pstrArr[2000];
{
std::string s1; //str holds the content of the file
std::getline(std::ifstream("opencl.hpp"), s1, '\0'); // 326kB
for(int i=0;i<2000;i++)
{
// disabling caching (0) as it increases memory footprint per instance!
pstrArr[i]=CompressedStringLib::PredictorString<UnsignedIntegerPrefixType>(s1,0,CompressedStringLib::PredictorString<UnsignedIntegerPrefixType>::OPTIMIZE_WITH_HUFFMAN_ENCODING);
}
}
int ii;
std::cin>>ii;
return 0;
}
According to Htop, 11.6% of 4GB RAM is consumed by the app. This is 464 megabytes. 464 MB / 2000 = 232 kB
In PredictorString, algorithm uses a dynamically built boolean dictionary to guess the next character in the input. Depending on true/false predictions, it edits dictionary and on every failed prediction it carries character to output and on every successful prediction it just updates dictionary. Depending on combination of failures/successes, it builds a mask (1=success, 0=fail) and writes it to a prefix output. Then in decompression part, it does exact opposite as same dictionary is dynamically built again. This has the extra memory footprint only during encoding and decoding.
In HuffmanString, there is Huffman Tree building -> encoding of string -> decoding of string phases. Huffman Tree is built by frequencies of character codes. The most frequent characters are placed on top nodes while lower-frequent characters are placed on deeper nodes. Then they are traversed top-down approach to build tags for them. On every "left" path taken, a "0" is added to tag, on every "right" path taken, "1" is added to path, until target char's node is reached. This effectively encodes plain text with non-random content into much shorter names, as low as single bit per char.