jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

Not able to create PMMLs from pipelines having custom text processing functions #150

Closed dilipsundar closed 5 years ago

dilipsundar commented 5 years ago

Hi @vruusmann

I'm trying to clean and vectorize my text before feeding it to my model and this is what I've written to do that.

class TextPreprocessor(BaseEstimator, TransformerMixin):

    def clean_text(self, text):
        text = re.sub(r'[^\x00-\x7F]+',' ', text)
        text = contractions.fix(text)
        text = text.lower()
        text = text.strip()
        text = re.sub(' +', ' ',text)
        text = re.sub('[^A-Za-z]+', ' ', text)
        text = ' '.join([w for w in text.split() if len(w)>2])
        return text

    def word_tag(self, nltk_tag):
        if nltk_tag.startswith('J'):
            return wordnet.ADJ
        elif nltk_tag.startswith('V'):
            return wordnet.VERB
        elif nltk_tag.startswith('N'):
            return wordnet.NOUN
        elif nltk_tag.startswith('R'):
            return wordnet.ADV
        else:
            return None

    def lemmatize_sentence(self, sentence):
        sentence = self.clean_text(str(sentence))
        nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
        wn_tagged = map(lambda x: (x[0], self.word_tag(x[1])), nltk_tagged)
        res_words = []
        for word, tag in wn_tagged:
            if tag is None:
                res_words.append(word)
            else:
                res_words.append(WordNetLemmatizer().lemmatize(word, tag))
        final = ' '.join(res_words)
        return final

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if type(X)==str:
            X = [X]
        return [self.lemmatize_sentence(text) for text in X]

logit_pipeline = PMMLPipeline([('preproc', TextPreprocessor()), 
                               ('vect', CountVectorizer(ngram_range=(1,2))), 
                               ('tfidf', TfidfTransformer(use_idf=True)), 
                               ('clf', LogisticRegression(C=11.3))])

logit_pipeline.active_fields = np.asarray(["Input Text"])
logit_pipeline.target_fields = np.asarray(["Output"])
logit_pipeline.fit(X, y)
sklearn2pmml(logit_pipeline, 'logit.pmml', user_classpath = ["/path/sklearn2pmml-plugin/target/sklearn2pmml-plugin-1.0-SNAPSHOT.jar"])

When I run the last line, I'm getting this as an exception.

Standard output is empty
Standard error:
May 02, 2019 1:07:00 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
May 02, 2019 1:07:02 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 2150 ms.
May 02, 2019 1:07:02 AM org.jpmml.sklearn.Main run
INFO: Converting..
May 02, 2019 1:07:03 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class __main__.TextPreprocessor)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
        at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more

Exception in thread "main" java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class __main__.TextPreprocessor)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
        at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more

Traceback (most recent call last):
  File "test-pmml.py", line 126, in <module>
    sklearn2pmml(logit_pipeline, 'logit.pmml', user_classpath = ["/path/sklearn2pmml-plugin/target/sklearn2pmml-plugin-1.0-SNAPSHOT.jar"])
  File "/home/devuser/.local/lib/python2.7/site-packages/sklearn2pmml/__init__.py", line 252, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

Can you please help me understand what I'm doing wrong? Or does sklearn2pmml-plugin not support some of the things I'm trying to do here?

vruusmann commented 5 years ago

Functional duplicate of https://github.com/jpmml/sklearn2pmml/issues/91

The conversion of user-defined text preprocessors is not supported, and it's unlikely that it will be ever supported for the following reasons:

  1. PMML is not a general purpose programming language. It is impossible to take a Python application class (eg. main.TextPreprocessor) and translate it to PMML so that it's functionality is exactly preserved.
  2. Your text preprocessors depends on 3rd party libraries (some NLTK library). Given the impossibility of the above step, it's even more impossible to translate 3rd party compiled libraries to PMML.

What you should try/do:

  1. Refactor your data source so that the ML workflow would be dealing with clean data. Data cleaning/preprocessing is a complex and complicated process, and should be handled separately from the ML loop.
  2. PMML can so simple string processing such as RegExes and conversion to upper/lowercase. Try to systemize and extract this functionality out of the general data cleaning/preprocessing logic (and keep it closer to ML loop).
dilipsundar commented 5 years ago

Thanks for the suggestion @vruusmann So if I get rid of functions word_tag and lemmatize_sentence, I'll be able to include clean_sentence as part of the PipeLine?

vruusmann commented 5 years ago

So if I get rid of functions word_tag and lemmatize_sentence, I'll be able to include clean_sentence as part of the PipeLine?

Yes, it should be possible to translate the business logic of the clean_text method in PMML. The only problematic bit is the statement text = contractions.fix(text), which is probably calling some 3rd party library again.

Here's a Python to PMML mapping:

dilipsundar commented 5 years ago

I've modified the Preprocessor class as per your suggestion and it looks like this.

class TextPreprocessor(BaseEstimator, TransformerMixin):

    def clean_text(self, text):
        text = text.lower()
        text = text.strip()
        text = re.sub(' +', ' ',text)
        text = re.sub('[^A-Za-z]+', ' ', text)
        return text

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if type(X)==str:
            X = [X]
        return [self.clean_text(text) for text in X]

logit_pipeline = PMMLPipeline([('preproc', TextPreprocessor()), 
                               ('vect', CountVectorizer(ngram_range=(1,2))), 
                               ('tfidf', TfidfTransformer(use_idf=True)), 
                               ('clf', LogisticRegression(C=11.3))])

logit_pipeline.active_fields = np.asarray(["Input Text"])
logit_pipeline.target_fields = np.asarray(["Output"])
logit_pipeline.fit(X, y)
sklearn2pmml(logit_pipeline, 'logit.pmml', user_classpath = ["/path/sklearn2pmml-plugin/target/sklearn2pmml-plugin-1.0-SNAPSHOT.jar"])

The error I'm getting:

Standard output is empty
Standard error:
May 08, 2019 5:31:50 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
May 08, 2019 5:31:54 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 3791 ms.
May 08, 2019 5:31:54 AM org.jpmml.sklearn.Main run
INFO: Converting..
May 08, 2019 5:31:56 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class __main__.TextPreprocessor)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
        at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more

Exception in thread "main" java.lang.IllegalArgumentException: Tuple contains an unsupported value (Python class __main__.TextPreprocessor)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
        at com.google.common.collect.Lists$TransformingRandomAccessList.get(Lists.java:599)
        at sklearn.TransformerUtil.getHead(TransformerUtil.java:35)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Transformer
        at java.lang.Class.cast(Class.java:3369)
        at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
        ... 5 more

Traceback (most recent call last):
  File "test-pmml.py", line 126, in <module>
    sklearn2pmml(logit_pipeline, 'logit.pmml', user_classpath = ["/path/sklearn2pmml-plugin/target/sklearn2pmml-plugin-1.0-SNAPSHOT.jar"])
File "/home/devuser/.local/lib/python2.7/site-packages/sklearn2pmml/__init__.py", line 252, in sklearn2pmml
raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

Is there something else I've not removed as part of the clean up?

vruusmann commented 5 years ago

The idea is to expand the business logic of your main.TextPreprocessor.clean_text method into a list of SkLearn2PMML package's built-in transformer classes.

Something like this:

from sklearn2pmml.preprocessing import ReplaceTransformer, StringNormalizer

mapper = DataFrameMapper([
  (["Input Text"], [StringNormalizer("lowercase", trim_blanks = True), ReplaceTransformer(' +', ' '), ReplaceTransformer('[^A-Za-z]+', ' ')])
])
pipeline = PMMLPipeline([
  ("mapper", mapper),
  ('vect', CountVectorizer(ngram_range=(1,2))), 
  ('tfidf', TfidfTransformer(use_idf=True)), 
  ('clf', LogisticRegression(C=11.3)
])
pipeline.fit(X, y)
dilipsundar commented 5 years ago

DataFrameMapper doesn't work on objects that aren't dataframes right?

I'm expecting a string input, which I'm cleaning and converting into an array (the output of the TextPreprocessor's transform function) before passing it to the CountVectorizer and others.

vruusmann commented 5 years ago

DataFrameMapper doesn't work on objects that aren't dataframes right?

It is designed to work with pandas.DataFrame, because it selects columns based on name (not possible with Numpy arrays).

I'm expecting a string input, which I'm cleaning and converting into an array

You have a one-dimensional input space, right? Just "convert" it temporarily from Numpy array to pandas.DataFrame. You can name this singleton column whatever you like, it doesn't have to be called "Input Text".

Alternatively, you can replace DataFrameMapper with a plain sklearn.pipeline.Pipeline:

pipeline = PMMLPipeline([
  ("text_proc", Pipeline([
    ("lower_trim", StringNormalizer(..)),
    ("rep1", ReplaceTransformer(..)),
    ("rep2", ReplaceTransformer(..))
  ])),
  ('vect', CountVectorizer(ngram_range=(1,2))),
  ('tfidf', TfidfTransformer(use_idf=True)),
  ('clf', LogisticRegression(C=11.3)
])
dilipsundar commented 5 years ago

You have a one-dimensional input space, right? Just "convert" it temporarily from Numpy array to pandas.DataFrame.

I'm expecting a text input i.e. str object. Is that possible to do this conversion within the pipeline itself?

nwxxb commented 9 months ago

I am sorry if I meddling in this old discussion, but when I try your example on https://github.com/jpmml/sklearn2pmml/issues/150#issuecomment-490780505 it seems that ReplaceTransformer returns numpy.ndarray which doesn't have attribute lower (that called in CountVectorizer). Am I missing some steps?

vruusmann commented 9 months ago

@nwxxb

Did you try running the referenced example? What was the Python error? Or were you simply hypothesizing?

it seems that ReplaceTransformer returns numpy.ndarray which doesn't have attribute lower

All SkLearn transformers return array-like data containers. What matters is the element type of a data container.

In the current case, the ReplaceTransformer should return an "array of strings" (the data container type is numpy.ndarray, its element type is str).

The next pipeline step will be calling the lower() function on array elements (which are strings), not the array itself.

nwxxb commented 9 months ago

Thank you for your response and yes I did, here is my code when I running it:

# df_try.shape == (8, 2), with two columns:
# `text_content (str)` and `label_id (float)`
text_try = pd.Series(data=d, name="text_content")
label_try = pd.Series(data=labels, name="label_id")
df_try = pd.concat([text_try, label_try], axis=1)

pipeline = PMMLPipeline([
  ("preprocessing", Pipeline([
    ("remove_number", ReplaceTransformer('\d+', '')),
  ])),
  ('vect', CountVectorizer(ngram_range=(1,2))),
])

pipeline.fit_transform(df_try['text_content'])

yield error:

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

[<ipython-input-20-54aadab62a28>](https://localhost:8080/#) in <cell line: 13>()
     11 ])
     12 
---> 13 pipeline.fit_transform(df_try['text_content'])

4 frames

[/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py](https://localhost:8080/#) in fit_transform(self, X, y, **fit_params)
    443             fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    444             if hasattr(last_step, "fit_transform"):
--> 445                 return last_step.fit_transform(Xt, y, **fit_params_last_step)
    446             else:
    447                 return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)

[/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py](https://localhost:8080/#) in fit_transform(self, raw_documents, y)
   1386                     break
   1387 
-> 1388         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
   1389 
   1390         if self.binary:

[/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py](https://localhost:8080/#) in _count_vocab(self, raw_documents, fixed_vocab)
   1273         for doc in raw_documents:
   1274             feature_counter = {}
-> 1275             for feature in analyze(doc):
   1276                 try:
   1277                     feature_idx = vocabulary[feature]

[/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py](https://localhost:8080/#) in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
    109     else:
    110         if preprocessor is not None:
--> 111             doc = preprocessor(doc)
    112         if tokenizer is not None:
    113             doc = tokenizer(doc)

[/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py](https://localhost:8080/#) in _preprocess(doc, accent_function, lower)
     67     """
     68     if lower:
---> 69         doc = doc.lower()
     70     if accent_function is not None:
     71         doc = accent_function(doc)

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

it seems that ReplaceTransformer actually wrap the input string on numpy.ndarray which CustomVectorizer calls lower method on it.

vruusmann commented 9 months ago

@nwxxb Thanks for bringing this issue to my attention. Not exactly a bug, but needs fixing nonetheless.

I started playing with your example code, and discovered it's an "array shape mismatch". Specifically, the ReplaceTransformer is returning a 2-D array (n_rows, 1). However, the CountVectorizer only supports 1-D arrays (aka column vectors), which have a shape (n_rows, ) (note the missing second dimension!).

The issue can be fixed by reshaping the array between ReplaceTransformer and CountVectorizer steps from 2-D to 1-D (aka column vector) using the sklearn2pmml.util.Reshaper meta-transformer:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn2pmml.preprocessing import ReplaceTransformer
from sklearn2pmml.util import Reshaper

pipeline = Pipeline([
    ("replace", ReplaceTransformer("\d+", "")),
    # THIS! Reshape from 2-D array of strings to 1-D array of strings
    ("reshaper", Reshaper((-1, ))),
    ("count_vectorize", CountVectorizer(ngram_range = (1, 2)))
])

Looks like Numpy arrays support string functions (lower, upper, trim etc.) only on 1-D (row or column vectors).

nwxxb commented 9 months ago

Thank you for your response it solved my problem :smile:

vruusmann commented 6 months ago

@nwxxb If you're still doing [ReplaceTransformer(), CountVectorizer()] workflows, then starting from SkLearn2PMML version 0.103.2 it's possible to omit the Reshaper() helper step.

See https://github.com/jpmml/sklearn2pmml/commit/e6bd044b7c3ddaf9cdac3281971cfe70c1710407