`TFIDF.transform_many()` fails on `DataFrame` input

bdewilde commented 1 month ago

Versions

river version: 0.21.2 Python version: 3.11.7 Operating system: macOS 14.4

Describe the bug

The TFIDF feature extractor claims to support both online and mini-batch transformations, but the latter case only works when the transformer doesn't specify the on parameter. In other words, batch mode works for pd.Series input, but not pd.Dataframe.

Steps/code to reproduce

import pandas as pd
import river.feature_extraction

model = river.feature_extraction.TFIDF()
X = pd.Series(["foo bar bat baz", "foo bar spam eggs"])
for rec in X:
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# WORKS
#    foo  bar  bat  baz  spam  eggs
# 0    1    1    1    1     0     0
# 1    1    1    0    0     1     1

model = river.feature_extraction.TFIDF(on="text")
X = pd.DataFrame([{"text": "foo bar bat baz"}, {"text": "foo bar spam eggs"}])
for rec in X.to_dict(orient="records"):
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# DOES NOT WORK

That last call produces the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[95], line 1
----> 1 print(model.clone().transform_many(X))

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:349](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=348), in BagOfWords.transform_many(self, X)
    347 for d in X:
    348     t: int
--> 349     for t, f in collections.Counter(self.process_text(d)).items():
    350         indices.append(index.setdefault(t, len(index)))
    351         data.append(f)

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:220](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=219), in VectorizerMixin.process_text(self, x)
    218 def process_text(self, x):
    219     for step in self.processing_steps:
--> 220         x = step(x)
    221     return x

TypeError: string indices must be integers, not 'str'

bdewilde commented 1 month ago

Well, I looked at the source code. Unfortunately, the problems go deeper than inconsistent type annotations:

While TFIDF implements its own .learn_one() and .transform_one() methods, which look to be doing the right thing, it inherits .learn_many() and .transform_many() from its parent BagOfWords class. The BagOfWords.learn_many() method is a no-op, because it doesn't need to learn anything; but TFIDF certainly does! And the BagOfWords.transform_many() method just creates a bag of words with integer counts from each record, so there's no term-frequency / inverse-doc-frequency weighting happening, as would be expected for the TFIDF transformer.

I'm going to side-step all of this for now by just iterating over the records one at a time, since the mini-batch operations are both invalid for TFIDF. If I find some spare time, I'll see about submitting a PR to fix.

bdewilde commented 1 month ago

I took a shot at implementing learn/transform many for the TFIDF transformer. Transformed data / predictions are quite similar when using these .*_many() methods over a batch compared to .*_one() equivalents in a stream, so I think they're doing the correct calculations. I'm less sure about if these implementations are "good" and/or in line with standards followed elsewhere in this lib. At the very least, here's what it could look like::

class MyTFIDF(river.feature_extraction.TFIDF):
    def learn_many(self, X: pd.Series) -> None:
        # increment global document counter
        self.n += X.shape[0]
        # update document counts
        doc_counts = (
            X.map(lambda x: set(self.process_text(x)))
            .explode()
            .value_counts()
            .to_dict()
        )
        self.dfs.update(doc_counts)

    def transform_many(self, X: pd.Series) -> pd.DataFrame:
        """Transform pandas series of string into tf-idf pandas sparse dataframe."""
        indptr, indices, data = [0], [], []
        index: dict[int, int] = {}
        for doc in X:
            term_weights: dict[int, float] = self.transform_one(doc)
            for term, weight in term_weights.items():
                indices.append(index.setdefault(term, len(index)))
                data.append(weight)
            indptr.append(len(data))

        return pd.DataFrame.sparse.from_spmatrix(
            scipy.sparse.csr_matrix((data, indices, indptr)),
            index=X.index,
            columns=index.keys(),
        )

online-ml / river