kaitz / paq8pxd

GNU General Public License v2.0
68 stars 7 forks source link
compression filter

Paq8pxd

To install and use in Windows:

A .paq8pxd extension is added for compression, removed for decompression. The output will go in the same folder as the input.

While paq8pxd is working, a command window will appear and report progress. When it is done you can close the window by pressing ENTER or clicking [X].

COMMAND LINE INTERFACE

The compressed output file is named by adding ".paq8pxd" extension to the first named file (file1.paq8pxd). Each file that exists will be added to the archive and its name will be stored without a path. The option -N specifies a compression level ranging from -0 (fastest) to -8 (smallest). The default is -5. If there is no option and only one file, then the program will pause when finished until you press the ENTER key (to support drag and drop). If file1.paq8pxd exists then it is overwritten. Level -0 only transforms or decompresses data.

If the first named file ends in ".paq8pxd" then it is assumed to be an archive and the files within are extracted to the same directory as the archive unless a different directory (dir2) is specified. The -d option forces extraction even if there is not a ".paq8pxd" extension. If any output file already exists, then it is compared with the archive content and the first byte that differs is reported. No files are overwritten or deleted. If there is only one argument (no -d or dir2) then the program will pause when finished until you press ENTER.

For compression, if any named file is actually a directory, then all files and subdirectories are compressed, preserving the directory structure, except that empty directories are not stored, and file attributes (timestamps, permissions, etc.) are not preserved. During extraction, directories are created as needed. For example:

paq8pxd -4 c:\tmp\foo bar

compresses foo and bar (if they exist) to c:\tmp\foo.paq8pxd at level 4.

paq8pxd -d c:\tmp\foo.paq8pxd .

extracts foo and compares bar in the current directory. If foo and bar are directories then their contents are extracted/compared.

There are no commands to update an existing archive or to extract part of an archive. Files and archives larger than 2GB are not supported (but might work on 64-bit machines, not tested). File names with nonprintable characters are not supported (spaces are OK).

TO COMPILE

Make sure bzip2 libraries are present

MINGW g++ pacman -S mingw-w64-x86_64-bzip2

UNIX/Linux sudo apt-get install libbz2-dev

There are 2 files: paq8pxd.cpp (C++) and wrtpre.cpp (C++). paq8pxd.cpp recognizes the following compiler options:

-DWINDOWS (to compile in Windows) -DUNIX (to compile in Unix, Linux, etc) -DMT (to compile with multithreading support) -DDEFAULT_OPTION=N (to change the default compression level from 5 to N).

If you compile without -DWINDOWS or -DUNIX, you can still compress files, but you cannot compress directories or create them during extraction. You can extract directories if you manually create the empty directories first.

Use -DEFAULT_OPTION=N to change the default compression level to support drag and drop on machines with less than 256 MB of memory. Use -DDEFAULT_OPTION=4 for 128 MB, 3 for 64 MB, 2 for 32 MB, etc.

Recommended compiler commands and optimizations:

MINGW g++ (x86,x64):
with multithreading:
g++ paq8pxd.cpp -DWINDOWS -DMT -msse2 -O3 -s -static -lz -lbz2 -o paq8pxd.exe
without multithreading:
g++ paq8pxd.cpp -DWINDOWS -msse2 -O3 -s -static -lz -lbz2 -o paq8pxd.exe

UNIX/Linux (PC x86,x64):
with multithreading:
g++ paq8pxd.cpp -DUNIX -DMT -msse2 -O3 -s -static -lpthread -lz -lbz2 -o paq8pxd
without multithreading:
g++ paq8pxd.cpp -DUNIX -msse2 -O3 -s -static -lpthread -lz -lbz2 -o paq8pxd

Non PC (e.g. PowerPC under MacOS X):
g++ paq8pxd.cpp -O2 -DUNIX -s -lz -lbz2 -o paq8pxd

Alternatively, you can use CMake to build paq8pxd.

CMake recognizes the following compiler options for paq8pxd:
-DUNIX: Whether to build for Unix. Otherwise, build for Windows)
-DNATIVECPU: Whether to build for your cpu (vs. the general public). Default is OFF)
-DMT: Whether to enable Multithreading. Default is OFF)
-DDISABLE_SM: Whether to disable faster statemaps. Default is OFF)

To build for Windows in MinGW with Multithreading and build a native executable for your CPU:
cmake . -G "MSYS Makefiles" -DMT=ON -DNATIVECPU=ON

To build for Unix systems with Multithreading and build a native executable for your CPU:
cmake . -DUNIX=ON -DMT=ON -DNATIVECPU=ON

Then build with make:
make

ARCHIVE FILE FORMAT

An archive has the following format.

paq8pxd -N segment size compressed segment size segment offset \0 file list size compressed file list( size TAB filename CR LF size TAB filename CR LF ...) compressed binary data file segmentation data stream data sizes[11]

-N is the option (-0 to -15) and mode, even if a default was used. 00LMNNNN bit M is set if fast mode, bit L is set if quick mode, if L or M are not set default to slow mode.

segment size is total size of file(s) compressed segment size is compressed segmentation data in bytes at segmnet offset after compressed binary data.

file segmentation data is full list of detected blocks: type size info type size info type size type size info .....

info is present if block type needs extra info like in image or audio.

Plain file names are stored without a path. Files in compressed directories are stored with path relative to the compressed directory (using UNIX style forward slashes "/"). For example, given these files:

123 C:\dir1\file1.txt 456 C:\dir2\file2.txt

Then

paq8pxd archive \dir1\file1.txt \dir2

will create archive.paq8pxd

The command:

paq8pxd archive.paq8pxd C:\dir3

will create the files:

C:\dir3\file1.txt C:\dir3\dir2\file2.txt

Decompression will fail if the first 10 bytes are not "paq8pxd -". Sizes are stored as decimal numbers. CR, LF, TAB are ASCII codes 13, 10, 9 respectively.

ARITHMETIC CODING

The binary data is arithmetic coded as the shortest base 256 fixed point number x = SUM_i x_i 256^-1-i such that p(<y) <= x < p(<=y), where y is the input string, xi is the i'th coded byte, p(<y) (and p(<=y)) means the probability that a string is lexicographcally less than (less than or equal to) y according to the model, denotes subscript, and ^ denotes exponentiation.

The model p(y) for y is a conditional bit stream, p(y) = PROD_j p(y_j | y_0..j-1) where y_0..j-1 denotes the first j bits of y, and y_j is the next bit. Compression depends almost entirely on the ability to predict the next bit accurately.

MODEL MIXING

paq8pxd uses a neural network to combine a large number of models. The i'th model independently predicts p1_i = p(y_j = 1 | y_0..j-1), p0_i = 1 - p1_i. The network computes the next bit probabilty

p1 = squash(SUM_i w_i t_i), p0 = 1 - p1 (1)

where t_i = stretch(p1_i) is the i'th input, p1_i is the prediction of the i'th model, p1 is the output prediction, stretch(p) = ln(p/(1-p)), and squash(s) = 1/(1+exp(-s)). Note that squash() and stretch() are inverses of each other.

After bit y_j (0 or 1) is received, the network is trained:

w_i := w_i + eta t_i (y_j - p1) (2)

where eta is an ad-hoc learning rate, t_i is the i'th input, (y_j - p1) is the prediction error for the j'th input but, and w_i is the i'th weight. Note that this differs from back propagation:

w_i := w_i + eta t_i (y_j - p1) p0 p1 (3)

which is a gradient descent in weight space to minimize root mean square error. Rather, the goal in compression is to minimize coding cost, which is -log(p0) if y = 1 or -log(p1) if y = 0. Taking the partial derivative of cost with respect to w_i yields (2).

MODELS

Most models are context models. A function of the context (last few bytes) is mapped by a lookup table or hash table to a state which depends on the bit history (prior sequence of 0 and 1 bits seen in this context). The bit history is then mapped to p1_i by a fixed or adaptive function. There are several types of bit history states:

CONTEXTS

High compression is achieved by combining a large number of contexts. Most (not all) contexts start on a byte boundary and end on the bit immediately preceding the predicted bit. The contexts below are modeled with both a run map and a nonstationary map unless indicated.

ARCHITECTURE

The context models are mixed by several of several hundred neural networks selected by a low-order context. The outputs of these networks are combined using a second neural network, then fed through several stages of adaptive probability maps (APM) before arithmetic coding.

For images, only one neural network is used and its context is fixed.

An APM is a stationary map combining a context and an input probability. The input probability is stretched and divided into 32 segments to combine with other contexts. The output is interpolated between two adjacent quantized values of stretch(p1). There are 2 APM stages in series:

p1 := (p1 + 3 APM(order 0, p1)) / 4. p1 := (APM(order 1, p1) + 2 APM(order 2, p1) + APM(order 3, p1)) / 4.

PREPROCESSING

paq8pxd uses preprocessing transforms on certain data types to improve compression. To improve reliability, the decoding transform is tested during compression to ensure that the input file can be restored. If the decoder output is not identical to the input file due to a bug, then the transform is abandoned and the data is compressed without a transform so that it will still decompress correctly.

The input is split into blocks with the format where is 1 byte (0 = no transform), is the size of the data after decoding, which may be different than the size of . Data is stored uncompressed after compressed data ends. The preprocessor has 3 parts:

IMPLEMENTATION

Hash tables are designed to minimize cache misses, which consume most of the CPU time.

Most of the memory is used by the nonstationary context models. Contexts are represented by 32 bits, possibly a hash. These are mapped to a bit history, represented by 1 byte. The hash table is organized into 64-byte buckets on cache line boundaries. Each bucket contains 7 x 7 bit histories, 7 16-bit checksums, and a 2 element LRU queue packed into one byte. Each 7 byte element represents 7 histories for a context ending on a 3-bit boundary plus 0-2 more bits. One element (for bits 0-1, which have 4 unused bytes) also contains a run model consisting of the last byte seen and a count (as 1 byte each).

Run models use 4 byte hash elements consisting of a 2 byte checksum, a repeat count (0-255) and the byte value. The count also serves as a priority.

Stationary models are most appropriate for small contexts, so the context is used as a direct table lookup without hashing.

The match model maintains a pointer to the last match until a mismatching bit is found. At the start of the next byte, the hash table is referenced to find another match. The hash table of pointers is updated after each whole byte. There is no checksum. Collisions are detected by comparing the current and matched context in a rotating buffer.

The inner loops of the neural network prediction (1) and training (2) algorithms are implemented in MMX assembler, which computes 4 elements at a time. Using assembler is 8 times faster than C++ for this code and 1/3 faster overall. (However I found that SSE2 code on an AMD-64, which computes 8 elements at a time, is not any faster).

SEE ALSO

paq8px https://github.com/hxim/paq8px