Open MaxHalford opened 11 months ago
Hi, I would like to work on this issue.
Great! There's a lot of good stuff to glean from reading the Hackernews thread and the comments on the blog post.
Hi, I'm working in this issue with @ayameira-loop and we wanted to check some results.
The training algorithm we've implemented so far uses the Zstandard library. Basically, the steps are:
Predicting is pretty much as described here (given an unlabeled text, concatenate it with training documents for each label, compress it and measure the size).
(The implementation can be found here)
Testing We tested the performance of the text compression classifier with the 20newsgroup dataset using the same 4 categories (alt.atheism, talk.religion.misc, comp.graphics, sci.space). For each sample, the steps are: predict_one, update metric (accuracy), learn_one.
The results we obtained can be observed in the table bellow:
We were wondering if the results look good so far.
Hey there! That's a really good start, and to be honest we could integrate it as a first version in River.
The downside I see is that the compression state is rebuilt each time a new sample is seen. I would like to see an implementation where the compressor is streaming, in a sense. Do you see what I mean?
Hi! That's really nice!
I see, we couldn't find a more optimized way to handle the compression state with Zstandard library (i.e. update the Compressor instead of rebuilding it).
Compression algorithms can be used as learning examples. For instance see here for text classification. It would be fun to do this with an incremental compression algorithm that has limited memory and "forgets" old data.
Funding