onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
554 stars 104 forks source link

sklearn column transformer with Tfidfvectorizer requires column to be defined with its positional reference as integer #995

Open sharathts14 opened 1 year ago

sharathts14 commented 1 year ago

If the column is referenced with its column name as string, facing a RunTimeError as below

RuntimeError: Unable to find column name 'subject' among names ['input']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

with the same example as in https://onnx.ai/sklearn-onnx/auto_examples/plot_tfidfvectorizer.html when the training dataset is converted to a Pandas dataframe and the column transformer is referenced with column name, the above error can be reproduced.

below is the code to reproduce:


import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd
import matplotlib.pyplot as plt
import os
#from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducer
import numpy
import onnxruntime as rt
from skl2onnx.common.data_types import StringTensorType
from skl2onnx import convert_sklearn
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
try:
    from sklearn.datasets._twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
except ImportError:
    # scikit-learn < 0.24
    from sklearn.datasets.twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

# limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
                           subset='train',
                           categories=categories,
                           )
test = fetch_20newsgroups(random_state=1,
                          subset='test',
                          categories=categories,
                          )

class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
  """Extract the subject & body from a usenet post in a single pass.
  Takes a sequence of strings and produces a dict of sequences. Keys are
  `subject` and `body`.
  """

  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    # construct object dtype array with two columns
    # first column = 'subject' and second column = 'body'
    features = np.empty(shape=(len(posts), 2), dtype=object)
    for i, text in enumerate(posts):
      headers, _, bod = text.partition('\n\n')
      bod = strip_newsgroup_footer(bod)
      bod = strip_newsgroup_quoting(bod)
      features[i, 1] = bod

      prefix = 'Subject:'
      sub = ''
      for line in headers.split('\n'):
        if line.startswith(prefix):
          sub = line[len(prefix):]
          break
      features[i, 0] = sub

    return features

train_data = SubjectBodyExtractor().fit_transform(train.data)
test_data = SubjectBodyExtractor().fit_transform(test.data)

# convert training data to dataframe so that column name can be used instead of column index
train_df = pd.DataFrame(train_data, columns = ['subject', 'body'])
print(train_df.head(1))

pipeline = Pipeline([
    ('union', ColumnTransformer(
        [
            ('subject', TfidfVectorizer(min_df=50, max_features=500), 'subject'),  # 0, is replaced with column name 'subject'

            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),  # 1, is replaced with column name 'body'

            # Removed from the original example as
            # it requires a custom converter.
            # ('body_stats', Pipeline([
            #   ('stats', TextStats()),  # returns a list of dicts
            #   ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            # ]), 1),
        ],

        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            # 'body_stats': 1.0,
        }
    )),

    # Use a LogisticRegression classifier on the combined features.
    # Instead of LinearSVC (not fully ready in onnxruntime).
    ('logreg', LogisticRegression()),
])

pipeline.fit(train_df, train.target)
print(pipeline.steps)
#print(classification_report(pipeline.predict(test_data), test.target))

seps = {
    TfidfVectorizer: {
        "separators": [
            ' ', '.', '\\?', ',', ';', ':', '!',
            '\\(', '\\)', '\n', '"', "'",
            "-", "\\[", "\\]", "@"
        ]
    }
}
model_onnx = convert_sklearn(
    pipeline, "tfidf",
    initial_types=[("input", StringTensorType([None, 2]))],
    # options=seps,
    target_opset=12)
sharathts14 commented 1 year ago

I am not sure if this is bug or a currently requires us to specify column positional integer as column name (string) currently not supported?

sharathts14 commented 1 year ago

I realize the issue in the above code in defining the "initial_types" which is obviously changed after converting to dataframe.

with initial_types=[("subject", StringTensorType([None, 1])), ("body", StringTensorType([None, 1]))],

the code works fine and the issue filed here stands invalid.

But with my internal example which i cannot share here shows error as below:

RuntimeError: Unable to find column name 'command_normalized' among names ['variable']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

and somehow my input variables defined are getting converted to [Variable('variable', 'variable6', type=FloatTensorType(shape=[]))] in _parse_sklearn_column_transformer of _parse.py

further debug in process

sharathts14 commented 1 year ago

Ah, I see the 'note' section in https://onnx.ai/sklearn-onnx/api_summary.html which exactly mentions the same

jjasont commented 11 months ago

@sharathts14 hi there, recently stumbled upon this limitation as well. How do you resolve this and what's your workaround for this limitation?