kodejuice / arithmetic-compressor

Arithmetic coding algorithm in python along with statistical data compression models like ppm, context mixing, etc.
MIT License
13 stars 3 forks source link
arithmetic-coding compression encoding-library entropy huffman-coding information-theory neural-network shannon-entropy

Arithmetic Coding Library

Run tests

This library is an implementation of the Arithmetic Coding algorithm in Python, along with adaptive statistical data compression models like PPM (Prediction by Partial Matching), Context Mixing and Simple Adaptive models.

Installation

To install the library, you can use pip:

pip install arithmetic_compressor

Usage

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import StaticModel

# create the model
model = StaticModel({'A': 0.5, 'B': 0.25, 'C': 0.25})

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = "AAAAAABBBCCC"
N = len(data)
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1]

And here's an example of how to decode the encoded data:

decoded = coder.decompress(compressed, N)

print(decoded) # -> ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']

API Reference

Arithmetic Compressing

Models

In addition to the arithmetic coding algorithm, this library also includes adaptive statistical models that can be used to improve compression.

All models implement these common methods:

Models Usage

A closer look at all the models.

Simple Models

The Simple Adaptive models are models that adapts to the probability of a symbol based on the frequency of the symbol in the data.

Here's an example of how to use the Simple Adaptive models included in the library:

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import\
   BaseFrequencyTable,\
   SimpleAdaptiveModel

# create the model
# model = SimpleAdaptiveModel({'A': 0.5, 'B': 0.25, 'C': 0.25})
model = BaseFrequencyTable({'A': 0.5, 'B': 0.25, 'C': 0.25})

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = "AAAAAABBBCCC"
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1]

The BaseFrequencyTable does an incremental adaptation to adapt to the statistics of the input data while the SimpleAdaptiveModel is essentially an exponential moving average that adapts to input data relative to the adaptation_rate.

PPM models

https://en.wikipedia.org/wiki/Prediction_by_partial_matching

PPM (Prediction by Partial Matching) models are a type of context modeling that uses a set of previous symbols to predict the probability of the next symbol. Here's an example of how to use the PPM models included in the library:

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import\
   PPMModel,\
   MultiPPM

# create the model
model = PPMModel(['A', 'B', 'C'], k = 3) # no need to pass in probabilities, only symbols

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = "AAAAAABBBCCC"
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1]

The MultiPPM model uses weighted averaging to combine predictions from several PPM models to make a prediction, gives better compression when the input is large.

# create the model
model = MultiPPM(['A', 'B', 'C'], models = 4) # will combine PPM models with context sizes of 0 to 4

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = "AAAAAABBBCCC"
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1]

Binary version

The Binary PPM models BinaryPPM and MultiBinaryPPM behave just like normal PPM models demonstrated above, except that they only work for binary symbols 0 and 1.

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import\
   BinaryPPM,\
   MultiBinaryPPM

# create the model
model = BinaryPPM(k = 3)

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1]
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1]

Likewise the MultiBinaryPPM will combine several Binary PPM models to make prediction using weighted averaging.

Context Mixing models

Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions.

Two general approaches have been used, linear and logistic mixing. Linear mixing uses a weighted average of the predictions weighted by evidence. While the logistic (or neural network) mixing first transforms the predictions into the logistic domain, log(p/(1-p)) before averaging.

The library contains a minimal implementation of the algorithm, only the core algorithm is implemented, it doesn't include as many contexts / models as in PAQ.

Note: They only work for binary symbols (0 and 1).

Linear Mixing

https://en.wikipedia.org/wiki/Context_mixing#Linear_Mixing

The mixer computes a probability by a weighted summation of the N models.

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import ContextMix_Linear

# create the model
model = ContextMix_Linear()

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1]

The Linear Mixing model lets you combine other models:

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import ContextMix_Linear,\
   SimpleAdaptiveModel,\
   PPMModel,\
   BaseFrequencyTable

# create the model
model = ContextMix_Linear([
  SimpleAdaptiveModel({0: 0.5, 1: 0.5}),
  BaseFrequencyTable({0: 0.5, 1: 0.5}),
  PPMModel([0, 1], k = 10)
])

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Logistic (Neural Network) Mixing

https://en.wikipedia.org/wiki/PAQ#Neural-network_mixing

A neural network is used to combine models.

from arithmetic_compressor import AECompressor
from arithmetic_compressor.models import ContextMix_Logistic

# create the model
model = ContextMix_Logistic()

# create an arithmetic coder
coder = AECompressor(model)

# encode some data
data = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
compressed = coder.compress(data)

# print the compressed data
print(compressed) # => [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1]

Note: This library is intended for learning and educational purposes only. The implementation may not be optimized for performance or memory usage and may not be suitable for use in production environments. Please consider the performance and security issues before using it in production. Please also note that you should thoroughly test the library and its models with real-world data and use cases before deploying it in production.

More Examples

You can find more detailed examples in the /examples folder in the repository. These examples demonstrate the capabilities of the library and show how to use the different models.

Contribution

Contributions are very much welcome to the library. If you have an idea for a new feature or have found a bug, please submit an issue or a pull request.

License

This library is distributed under the MIT License. See the LICENSE file for more information.