online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5.07k stars 547 forks source link

Support mini-batches in the naive_bayes module #398

Closed MaxHalford closed 3 years ago

MaxHalford commented 3 years ago

I believe that it shouldn't be too difficult to upgrade the models from the naive_bayes by allowing them to process mini-batches (i.e. pandas DataFrames), just like what we do in the linear_model module and in preprocessing.StandardScaler. Indeed, Naive Bayes models essentially boil down to "counting". The one model that might need to be refactored a bit more is the GaussianNB.

Of course, scikit-learn's naive_bayes module already supports mini-batches. However, we can bring some value to the table, just like we did for the models in linear_model:

Ideally, I would like to not be the person to implement this, to reduce the bus factor. I think it would be great if someone else could have a go at processing mini-batches.

I don't have the time to detail things right now, but I would be glad to answer questions and review pull requests.

raphaelsty commented 3 years ago

I'll take care of it.

I'm wondering about the data entry format for Naïve Bayes in mini-batch mode. I share here my intuition while waiting for confirmation.

I think that a Term Frequency dataframe is the input adapted to the Naïve Bayes model in mini-batch mode. It seems to me that it is the most coherent data structure compared to the type of input we chose for the incremental version: collections.counter.

>>> docs = [
...     ('Chinese Beijing Chinese', 'yes'),
...     ('Chinese Chinese Shanghai', 'yes'),
...     ('Chinese Macao', 'yes'),
...     ('Tokyo Japan Chinese', 'no')
... ]
image

If we select the Term Frequency DataFrame then it is the role of the feature_extraction.BagOfWords module to convert a pandas.Series of text into a Term Frequency DataFrame. 🙂

class BagOfWords:

    def transform_many(self, X: pd.Series):
        pass

Finally, if we want to vectorize NaiveBayes and take full advantage of the mini-matches, we should also vectorize feature_extraction.VectorizerMixin which is responsible for pre-processing the text.

class VectorizerMixin:

    def process_text_many(self, X: pd.Series):
        pass

I have adapted BernoulliNB following this format and it looks good to me. To be confirmed :)

Raphaël

MaxHalford commented 3 years ago

I think it's fine to start with having process_text_many just loop through the documents and call process_text. People can use scikit-learn's CountVectorizer and TFIDFVectorizer in addition to our BagOfWords and TFIDF.

MaxHalford commented 3 years ago

@raphaelsty could you look into using a sparse dataframe? Naive Bayes just requires "counting" so you should be able to work with a sparse dataframe :)

raphaelsty commented 3 years ago

Of course :)

raphaelsty commented 3 years ago

What about this format for prediction? 🙂

# DataFrame
>>> model.predict_proba_many(unseen_data)
     health   butcher
0  0.779191  0.220809
1  0.376923  0.623077

# Series
>>> model.predict_many(unseen_data)
0     health
1    butcher
dtype: object
MaxHalford commented 3 years ago

Yes perfect! Just make sure that the outputs have the same index as the input.

smastelini commented 3 years ago

Should this one be closed by #424?

MaxHalford commented 3 years ago

Indeed!