Closed ArlanCooper closed 1 year ago
train.py
to train a small language modellength = 32
data = random.randbytes(length)
compressed_data, num_padded_bits = language_model.compress(
data,
return_num_padded_bits=True,
use_slow_lossless_compression=True,
)
decompressed_data = language_model.decompress(
compressed_data,
num_padded_bits=num_padded_bits,
uncompressed_length=length,
)
assert data == decompressed_data
print('compression rate', len(compressed_data) / len(data))
Note that, due to numerical issues, when trying to compress and decompress, one needs to compute the token's pdf
s separately for every proper subsequence of the input sequence. This has a time complexity of O(n^2)
(whereas computing the pdf
s in a single go is O(n)
). See compressors/language_model.py
for the implementational details.
thank you
thanks for your great work! i have a small question to ask. can you give me a demo,which shows the way to use a language model to compress a text and decompress? thanks