sklearn column transformer with Tfidfvectorizer requires column to be defined with its positional reference as integer

sharathts14 commented 1 year ago

If the column is referenced with its column name as string, facing a RunTimeError as below

RuntimeError: Unable to find column name 'subject' among names ['input']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

with the same example as in https://onnx.ai/sklearn-onnx/auto_examples/plot_tfidfvectorizer.html when the training dataset is converted to a Pandas dataframe and the column transformer is referenced with column name, the above error can be reproduced.

below is the code to reproduce:


import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd
import matplotlib.pyplot as plt
import os
#from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducer
import numpy
import onnxruntime as rt
from skl2onnx.common.data_types import StringTensorType
from skl2onnx import convert_sklearn
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
try:
    from sklearn.datasets._twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
except ImportError:
    # scikit-learn < 0.24
    from sklearn.datasets.twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

# limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
                           subset='train',
                           categories=categories,
                           )
test = fetch_20newsgroups(random_state=1,
                          subset='test',
                          categories=categories,
                          )

class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
  """Extract the subject & body from a usenet post in a single pass.
  Takes a sequence of strings and produces a dict of sequences. Keys are
  `subject` and `body`.
  """

  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    # construct object dtype array with two columns
    # first column = 'subject' and second column = 'body'
    features = np.empty(shape=(len(posts), 2), dtype=object)
    for i, text in enumerate(posts):
      headers, _, bod = text.partition('\n\n')
      bod = strip_newsgroup_footer(bod)
      bod = strip_newsgroup_quoting(bod)
      features[i, 1] = bod

      prefix = 'Subject:'
      sub = ''
      for line in headers.split('\n'):
        if line.startswith(prefix):
          sub = line[len(prefix):]
          break
      features[i, 0] = sub

    return features

train_data = SubjectBodyExtractor().fit_transform(train.data)
test_data = SubjectBodyExtractor().fit_transform(test.data)

# convert training data to dataframe so that column name can be used instead of column index
train_df = pd.DataFrame(train_data, columns = ['subject', 'body'])
print(train_df.head(1))

pipeline = Pipeline([
    ('union', ColumnTransformer(
        [
            ('subject', TfidfVectorizer(min_df=50, max_features=500), 'subject'),  # 0, is replaced with column name 'subject'

            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),  # 1, is replaced with column name 'body'

            # Removed from the original example as
            # it requires a custom converter.
            # ('body_stats', Pipeline([
            #   ('stats', TextStats()),  # returns a list of dicts
            #   ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            # ]), 1),
        ],

        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            # 'body_stats': 1.0,
        }
    )),

    # Use a LogisticRegression classifier on the combined features.
    # Instead of LinearSVC (not fully ready in onnxruntime).
    ('logreg', LogisticRegression()),
])

pipeline.fit(train_df, train.target)
print(pipeline.steps)
#print(classification_report(pipeline.predict(test_data), test.target))

seps = {
    TfidfVectorizer: {
        "separators": [
            ' ', '.', '\\?', ',', ';', ':', '!',
            '\\(', '\\)', '\n', '"', "'",
            "-", "\\[", "\\]", "@"
        ]
    }
}
model_onnx = convert_sklearn(
    pipeline, "tfidf",
    initial_types=[("input", StringTensorType([None, 2]))],
    # options=seps,
    target_opset=12)

sharathts14 commented 1 year ago

I am not sure if this is bug or a currently requires us to specify column positional integer as column name (string) currently not supported?

sharathts14 commented 1 year ago

I realize the issue in the above code in defining the "initial_types" which is obviously changed after converting to dataframe.

with initial_types=[("subject", StringTensorType([None, 1])), ("body", StringTensorType([None, 1]))],

the code works fine and the issue filed here stands invalid.

But with my internal example which i cannot share here shows error as below:

RuntimeError: Unable to find column name 'command_normalized' among names ['variable']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

and somehow my input variables defined are getting converted to [Variable('variable', 'variable6', type=FloatTensorType(shape=[]))] in _parse_sklearn_column_transformer of _parse.py

further debug in process

sharathts14 commented 1 year ago

Ah, I see the 'note' section in https://onnx.ai/sklearn-onnx/api_summary.html which exactly mentions the same

jjasont commented 11 months ago

@sharathts14 hi there, recently stumbled upon this limitation as well. How do you resolve this and what's your workaround for this limitation?

onnx / sklearn-onnx

sklearn column transformer with Tfidfvectorizer requires column to be defined with its positional reference as integer #995