Closed MaxHalford closed 3 years ago
I'll take care of it.
I'm wondering about the data entry format for Naïve Bayes in mini-batch mode. I share here my intuition while waiting for confirmation.
I think that a Term Frequency dataframe is the input adapted to the Naïve Bayes model in mini-batch mode. It seems to me that it is the most coherent data structure compared to the type of input we chose for the incremental version: collections.counter
.
>>> docs = [
... ('Chinese Beijing Chinese', 'yes'),
... ('Chinese Chinese Shanghai', 'yes'),
... ('Chinese Macao', 'yes'),
... ('Tokyo Japan Chinese', 'no')
... ]
If we select the Term Frequency DataFrame then it is the role of the feature_extraction.BagOfWords
module to convert a pandas.Series of text into a Term Frequency DataFrame. 🙂
class BagOfWords:
def transform_many(self, X: pd.Series):
pass
Finally, if we want to vectorize NaiveBayes and take full advantage of the mini-matches, we should also vectorize feature_extraction.VectorizerMixin
which is responsible for pre-processing the text.
class VectorizerMixin:
def process_text_many(self, X: pd.Series):
pass
I have adapted BernoulliNB following this format and it looks good to me. To be confirmed :)
Raphaël
I think it's fine to start with having process_text_many
just loop through the documents and call process_text
. People can use scikit-learn's CountVectorizer
and TFIDFVectorizer
in addition to our BagOfWords
and TFIDF
.
@raphaelsty could you look into using a sparse dataframe? Naive Bayes just requires "counting" so you should be able to work with a sparse dataframe :)
Of course :)
What about this format for prediction? 🙂
# DataFrame
>>> model.predict_proba_many(unseen_data)
health butcher
0 0.779191 0.220809
1 0.376923 0.623077
# Series
>>> model.predict_many(unseen_data)
0 health
1 butcher
dtype: object
Yes perfect! Just make sure that the outputs have the same index as the input.
Should this one be closed by #424?
Indeed!
I believe that it shouldn't be too difficult to upgrade the models from the
naive_bayes
by allowing them to process mini-batches (i.e. pandas DataFrames), just like what we do in thelinear_model
module and inpreprocessing.StandardScaler
. Indeed, Naive Bayes models essentially boil down to "counting". The one model that might need to be refactored a bit more is theGaussianNB
.Of course, scikit-learn's
naive_bayes
module already supports mini-batches. However, we can bring some value to the table, just like we did for the models inlinear_model
:Ideally, I would like to not be the person to implement this, to reduce the bus factor. I think it would be great if someone else could have a go at processing mini-batches.
I don't have the time to detail things right now, but I would be glad to answer questions and review pull requests.