how to compress a text with a language model?

ArlanCooper commented 1 year ago

thanks for your great work! i have a small question to ask. can you give me a demo,which shows the way to use a language model to compress a text and decompress? thanks

anianruoss commented 1 year ago

Install and set up the repository as described in the README
Run train.py to train a small language model

Run the following example code:

length = 32
data = random.randbytes(length)
compressed_data, num_padded_bits = language_model.compress(
data,
return_num_padded_bits=True,
use_slow_lossless_compression=True,
)
decompressed_data = language_model.decompress(
compressed_data,
num_padded_bits=num_padded_bits,
uncompressed_length=length,
)
assert data == decompressed_data
print('compression rate', len(compressed_data) / len(data))

Note that, due to numerical issues, when trying to compress and decompress, one needs to compute the token's pdfs separately for every proper subsequence of the input sequence. This has a time complexity of O(n^2) (whereas computing the pdfs in a single go is O(n)). See compressors/language_model.py for the implementational details.

ArlanCooper commented 11 months ago

thank you

google-deepmind / language_modeling_is_compression

how to compress a text with a language model? #3