Incremental compression algorithm that forgets

MaxHalford commented 11 months ago

Compression algorithms can be used as learning examples. For instance see here for text classification. It would be fun to do this with an incremental compression algorithm that has limited memory and "forgets" old data.

Funding

You can sponsor this specific effort via a Polar.sh pledge below
We receive the pledge once the issue is completed & verified

ayameira-loop commented 10 months ago

Hi, I would like to work on this issue.

MaxHalford commented 10 months ago

Great! There's a lot of good stuff to glean from reading the Hackernews thread and the comments on the blog post.

mariliatd commented 8 months ago

Hi, I'm working in this issue with @ayameira-loop and we wanted to check some results.

The training algorithm we've implemented so far uses the Zstandard library. Basically, the steps are:

Stores the latest k concatenated texts for each label in a dictionary.
Creates a Zstandard compression dictionary with the concatenated texts.
Creates a Zstandard compression context for each label using the dictionary.

Predicting is pretty much as described here (given an unlabeled text, concatenate it with training documents for each label, compress it and measure the size).

(The implementation can be found here)

Testing We tested the performance of the text compression classifier with the 20newsgroup dataset using the same 4 categories (alt.atheism, talk.religion.misc, comp.graphics, sci.space). For each sample, the steps are: predict_one, update metric (accuracy), learn_one.

The results we obtained can be observed in the table bellow:

We were wondering if the results look good so far.

MaxHalford commented 8 months ago

Hey there! That's a really good start, and to be honest we could integrate it as a first version in River.

The downside I see is that the compression state is rebuilt each time a new sample is seen. I would like to see an implementation where the compressor is streaming, in a sense. Do you see what I mean?

mariliatd commented 8 months ago

Hi! That's really nice!

I see, we couldn't find a more optimized way to handle the compression state with Zstandard library (i.e. update the Compressor instead of rebuilding it).

online-ml / river

Incremental compression algorithm that forgets #1291

Funding