Open shitpoet opened 1 year ago
Oh, sorry i am just seeing this. I have no idea why i wasn't notified 🤔
About the issue, this usually happens when theres a symbol in the data that is not present in your frequency map, this is most likely a unicode symbol. So setting nsyms = 256
is not enough, you probably need to go through the data to get all possible symbols and build a frequency map out of that.
Something like this:
data = read_bytes(fn)
freq_map = {}
for c in data:
freq_map[c] = freq_map.get(c, 0) + 1
Again, sorry for the late reply
Hmm. I checked the maximum symbol value in enwik5.dat
after the read_bytes
function and it is 237
. So it seems that there is no unicode symbols after fopen(..., "rb")
.
From Python 3.11 docs:
Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding.
I just tested this code and it works fine for me? 🤔
fn = 'enwik5.zip'
print(fn)
def read_bytes(path):
with open(path, 'rb') as f:
return list(f.read())
data = read_bytes(fn)
nsyms = 256
stats = [0] * nsyms
for c in data:
stats[c] += 1
from arithmetic_compressor import AECompressor
from arithmetic_compressor.models.base_adaptive_model import BaseFrequencyTable
from arithmetic_compressor.models import StaticModel
from arithmetic_compressor.util import *
SCALE_FACTOR = 4096
freq_map = {
sym: freq for sym, freq in enumerate(stats)
if freq > 0
}
model = StaticModel(freq_map)
coder = AECompressor(model)
N = len(data)
compressed = coder.compress(data)
print(len(data))
print(len(compressed) // 8)
Output:
enwik5.zip
34850
34823
I am also on version 3.11
I'm trying to use this module on enwik5 data (10 000 bytes). But I encounter this error:
Are there any additional limitations in the implementation? Or do I do something wrong?
The script below works ok with enwik4 data (1 000 bytes).
I count statistics myself and then use
StaticModel
, but I encounter either thisLow or high out of range
error, orValueError: Symbol has zero frequency
error.enwik5.zip