jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

Failure to create a pmml file when using TfidfVectorizer with analyzer = 'char_wb' #202

Open selbenna opened 4 years ago

selbenna commented 4 years ago

Hi Villu,

I'm trying to create a pmml file from the sklearn model below. I use TfidfVectorizer on character level and a random forest classifier. The model predicts the datatype of column based on the name of this column. It works just fine with configuration below but when I create a PMML pipeline I get an error.

Here's my code:

raw_data = pd.read_csv("data_columns.csv")
X = raw_data["name"].tolist() 

labels = raw_data["type"].tolist()
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = to_categorical(labels)

vectorizer = TfidfVectorizer(analyzer = "char_wb", ngram_range=(1, 3), preprocessor = None, 
                                                lowercase = False, tokenizer = Splitter(), token_pattern = None, 
                                                norm = None)

rfc = RandomForestClassifier(n_estimators=500)

pipeline = PMMLPipeline([
  ("transformer", vectorizer),
    ("classifier",rfc) ])

pipeline.fit(X, labels)
sklearn2pmml(pipeline, "datatype_prediction.pmml", with_repr = True)

I get this error:

Standard output is empty
Standard error:
janv. 07, 2020 6:41:26 PM org.jpmml.sklearn.Main run
INFOS: Parsing PKL..
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
INFOS: Parsed PKL in 93847 ms.
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
INFOS: Converting..
janv. 07, 2020 6:43:00 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
janv. 07, 2020 6:43:00 PM org.jpmml.sklearn.Main run
**SEVERE: Failed to convert
java.lang.IllegalArgumentException: The value of 'sklearn.ensemble.forest.RandomForestClassifier.classes_' attribute (Java class java.util.ArrayList) is not a supported array type**
        at org.jpmml.sklearn.PyClassDict.getArray(PyClassDict.java:163)
        at sklearn.Classifier.getClasses(Classifier.java:43)
        at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: The value of 'sklearn.ensemble.forest.RandomForestClassifier.classes_' attribute (Java class java.util.ArrayList) is not a supported array type
        at org.jpmml.sklearn.PyClassDict.getArray(PyClassDict.java:163)
        at sklearn.Classifier.getClasses(Classifier.java:43)
        at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Traceback (most recent call last):

  File "<ipython-input-293-05f0766c0610>", line 1, in <module>
    sklearn2pmml(pipeline, "datatype_prediction.pmml", with_repr = True)

  File "/Users/.local/lib/python3.7/site-packages/sklearn2pmml/__init__.py", line 265, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")

**RuntimeError**: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

Could you please tell me what's wrong in the vectorizer? Or how could I use character level analyzer in TfidfVectorizer? Thank you,

Sarah

vruusmann commented 4 years ago

Let's solve one issue at a time. Right now, the SkLearn2PMML/JPMML-SkLearn stack is complaining an unexpected RandomForestClassifier.classes_ attribute value.

Why are you fitting a LabelEncoder object separately (and then passing its transformation results to the (PMML)Pipeline.fit(X, y) method)? Why don't you simply do the following?

raw_data = pd.read_csv("data_columns.csv")

X = raw_data["name"]
y = raw_data["type"]

pipeline = PMMLPipeline([
  ("transformer", vectorizer),
  ("classifier",rfc)
])
pipeline.fit(X, y)
selbenna commented 4 years ago

Thank you for your response!

I tried what you suggested and I get this error now:

Standard output is empty
Standard error:
janv. 08, 2020 9:47:22 AM org.jpmml.sklearn.Main run
INFOS: Parsing PKL..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Parsed PKL in 53752 ms.
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Converting..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
**java.lang.IllegalArgumentException: char_wb**
        at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
        at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
        at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
        at sklearn.Composite.encodeFeatures(Composite.java:129)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: char_wb
        at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
        at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
        at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
        at sklearn.Composite.encodeFeatures(Composite.java:129)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
        at org.jpmml.sklearn.Main.run(Main.java:145)
        at org.jpmml.sklearn.Main.main(Main.java:94

What do you think? Thanks!

vruusmann commented 4 years ago

java.lang.IllegalArgumentException: char_wb

That's the exception we've been looking for - it means that the "char_wb" text analyzer type is currently not supported.

Here's what can be done about it:

  1. Would it be possible to replace "char_wb" text analyzer with "char" text analyzer? I presume your text are mostly single-word tokens, so it shouldn't make any difference.
  2. It it's possible, then transform your text input from words to whitespace-separated tokens using the regex string transformer (insert it as the first step into your pipeline). Then, replace the "char" text analyzer with "word" text analyzer, and everything should work.
selbenna commented 4 years ago

1 - I already tried with the "char" text analyzer and I get the same error.

2 - I thought about this idea but wasn't sure it was going to work. Thanks for suggesting it, i'm going to try it and will let you know :)

Thanks

vruusmann commented 4 years ago

1 - I already tried with the "char" text analyzer and I get the same error.

The only supported text analyzer mode is "word".

Try to transform your text input column so that it could be regarded as a collection of words. The simplest solution would be to use a regex that surrounds every "useful" character with whitespace character.

It should be possible to upgrade the SkLearn2PMML/JPMML-SkLearn stack to support "char" and "char_wb" text analyzers as well. However, this is not a priority for me.

Leaving this issue open, in case my priorities change.

selbenna commented 4 years ago

I used a lambda function to split the column names into characters like this: raw_data["name"] = raw["name"].apply(lambda word: " ".join(word)) x = original_data["name"]

Then I used the pipeline above with the 'word' analyzer and I don't have the error anymore.

But I don't know how to include this preprocessing to the pipeline ... Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?

Could you please help?

Thanks

vruusmann commented 4 years ago

Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?

Probably not, because your lambda uses Python language constructs/functions that are not supported by ExpressionTransformer yet.

It should be possible using special-purpose string transformers. I'd try to formalize a regular expression (regex) pattern, and apply it to the original (aka raw) text feature using the sklearn2pmml.preprocessing.ReplaceTransformer.

Some regex that inserts whitespace characters into a string.

selbenna commented 4 years ago

I found a regex patter to insert whitespace characters:

regex_pattern = "(?<!^)(\B|b)(?!$)"
transformer =  ReplaceTransformer(regex_pattern, " ")

vectorizer = TfidfVectorizer(analyzer = "word", ngram_range=(1, 2), preprocessor = None,lowercase = False,
                             tokenizer = Splitter(), norm = None)

pipeline = PMMLPipeline([
          ("transformer", transformer),
          ("preprocessing", vectorizer),
          ("classifier",rfc) ])

pipeline.fit(x, y)

but i get this error:

TypeError: cannot use a string pattern on a bytes-like object`

How should I fit the data into this pipeline?

Thanks a lot for your help.

vruusmann commented 4 years ago

but i get this error:

That's a 100% Python language stack error (not (J)PMML one). It means that you're mixing/confusing str and bytes data types somewhere.

Perhaps you need to convert a bytes object to a str object by specifying what it the intended character encoding. Also, upgrading from Python 2.7 to Python 3.X might solve the issue.

selbenna commented 4 years ago

I found the error, by trying to inject the output of the transformer which is a ndarray object to the TfidfVectorizer I get:

TypeError: cannot use a string pattern on a bytes-like object

So I convert the ndarray to a Series:

transformed_data = transformer(x)

array_to_series = map(lambda x: x[0], transformed_data) ser = pd.Series(array_to_series)

I apply the TfidfVectorizer to ser

X = vectorizer.fit_transform(ser)

and I don't get the error anymore.

So my question is how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline please?

pipeline = PMMLPipeline([ ("transformer", transformer), ("preprocessing", vectorizer), ("classifier",rfc) ])

Should I insert another transformation between te transformer and the vectorizer?

Thank you!

vruusmann commented 4 years ago

Are you using the latest SkLearn2PMML package version? The ReplaceTransformer transformer should be returning a single-column 2-D Numpy array currently (the _col2d(X) utility method): https://github.com/jpmml/sklearn2pmml/blob/0.52.1/sklearn2pmml/preprocessing/__init__.py#L295

I wonder why the TfidfVectorizer step doesn't like it. Perhaps the ReplaceTransformer transformer should use some other return type/configuration.

how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline

Have you tried wrapping the whole feature engineering into sklearn_pandas.DataFrameMapper or sklearn.compose.ColumnTransformer? These meta-transformers are pretty good at reshaping data between steps.

Something like this:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    (["name"], [ReplaceTransformer(..), TfidfVectorizer(..)])
  ])),
  ("classifier", RandomForestClassifier())
])
selbenna commented 4 years ago

Yes I'm using the latest Sklearn2PMML. the ReplaceTransformer returns a 2-D numpy array and that's the problem, the TfidVectorizer doesn't like it.

I tried:

`pipeline = PMMLPipeline([ ("mapper", DataFrameMapper([ (["name"], [ReplaceTransformer("(?<!^)(\B)(?!$)", " "), vectorizer]) ])), ("classifier", rfc) ])

pipeline.fit(raw_data,y)`

But I get this error:

TypeError: ['name']: cannot use a string pattern on a bytes-like object

I tried also with ColumnTransformer and I get the same error.

selbenna commented 4 years ago

Is there a way I could convert the output of _col2d(x) using this function inside the pipeline?

array_to_series = map(lambda x: x[0], transform(x))
ser = pd.Series(array_to_series)

This ser variable would then be fed to the vectorizer:

vectorizer.fit_transform(ser)

Thank you!