kodejuice / arithmetic-compressor

Arithmetic coding algorithm in python along with statistical data compression models like ppm, context mixing, etc.
MIT License
13 stars 3 forks source link

AssertionError: Low or high out of range #1

Open shitpoet opened 1 year ago

shitpoet commented 1 year ago

I'm trying to use this module on enwik5 data (10 000 bytes). But I encounter this error:

AssertionError: Low or high out of range

Are there any additional limitations in the implementation? Or do I do something wrong?

The script below works ok with enwik4 data (1 000 bytes).

I count statistics myself and then use StaticModel, but I encounter either this Low or high out of range error, or ValueError: Symbol has zero frequency error.

enwik5.zip

fn = 'enwik5'

print(fn)

def read_bytes(path):
    with open(path, 'rb') as f:
        return list(f.read())

data = read_bytes(fn)
nsyms = 256
stats = [0] * nsyms
for c in data:
    stats[c] += 1

from arithmetic_compressor import AECompressor

from arithmetic_compressor.models.base_adaptive_model import BaseFrequencyTable
from arithmetic_compressor.util import *

SCALE_FACTOR = 4096

class StaticModel:
  """A static model, which does not adapt to input data or statistics."""

  def __init__(self, counts_dict):
    #vals = (v for k, v in counts_dict.items())
    #counts_sum = sum(vals)
    #probability = {k: v / counts_sum for k, v in counts_dict.items()}
    #print(probability)
    probability = counts_dict

    symbols = list(probability.keys())

    self.name = "Static"
    self.symbols = symbols
    self.__prob = dict(probability)

    # compute cdf from given probability
    cdf = {}
    prev_freq = 0
    self.freq = freq = {sym: round(SCALE_FACTOR * prob)
                        for sym, prob in probability.items()}
    for sym, freq in freq.items():
      cdf[sym] = Range(prev_freq, prev_freq + freq)
      prev_freq += freq
    self.cdf_object = cdf

  def cdf(self):
    return self.cdf_object

  def probability(self):
    return self.__prob

  def predict(self, symbol):
    assert symbol in self.symbols
    return self.probability()[symbol]

  def update(self, symbol):
    pass

  def test_model(self, gen_random=True, N=10000, custom_data=None):
    self.name = "Static Model"
    return BaseFrequencyTable.test_model(self, gen_random, N, custom_data)

freq_map = {
    sym: freq for sym, freq in enumerate(stats)
    if freq > 0
}

model = StaticModel(freq_map)
coder = AECompressor(model)

N = len(data)
compressed = coder.compress(data)
kodejuice commented 1 year ago

Oh, sorry i am just seeing this. I have no idea why i wasn't notified 🤔

About the issue, this usually happens when theres a symbol in the data that is not present in your frequency map, this is most likely a unicode symbol. So setting nsyms = 256 is not enough, you probably need to go through the data to get all possible symbols and build a frequency map out of that.

Something like this:

data = read_bytes(fn)
freq_map = {}
for c in data:
    freq_map[c] = freq_map.get(c, 0) + 1

Again, sorry for the late reply

shitpoet commented 1 year ago

Hmm. I checked the maximum symbol value in enwik5.dat after the read_bytes function and it is 237. So it seems that there is no unicode symbols after fopen(..., "rb").

From Python 3.11 docs:

Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding.

kodejuice commented 1 year ago

I just tested this code and it works fine for me? 🤔

fn = 'enwik5.zip'

print(fn)

def read_bytes(path):
    with open(path, 'rb') as f:
        return list(f.read())

data = read_bytes(fn)
nsyms = 256
stats = [0] * nsyms
for c in data:
    stats[c] += 1

from arithmetic_compressor import AECompressor

from arithmetic_compressor.models.base_adaptive_model import BaseFrequencyTable
from arithmetic_compressor.models import StaticModel
from arithmetic_compressor.util import *

SCALE_FACTOR = 4096

freq_map = {
    sym: freq for sym, freq in enumerate(stats)
    if freq > 0
}

model = StaticModel(freq_map)
coder = AECompressor(model)

N = len(data)
compressed = coder.compress(data)

print(len(data))
print(len(compressed) // 8)

Output:

enwik5.zip
34850
34823
kodejuice commented 1 year ago

I am also on version 3.11