CountVectorizer tokenizer has missing (None/null) value and unsupported value errors

eddies5 commented 5 years ago

I'm trying to use a CountVectorizer in PMMLPipeline to split a column's values on ##, but when I call sklearn2pmml(...) on my pipeline, I get an error. My model builds fine. I've tried two different approaches and each get me a different error.

Code

random_forest = RandomForestRegressor(n_estimators=20, min_samples_leaf=4, random_state=0)
vectorizer = # I'll discuss the two approaches I took with vectorizer below
pipeline = PMMLPipeline([
    ('mapper', DataFrameMapper([
        ('f1', vectorizer), 
        (['f3', 'f4'], None)
    ])),
    ('randomforestregressor', random_forest)
])
pipeline.fit(X_train, y_train)
sklearn2pmml(pipeline, "mymodel.pmml", debug=True)

Approach 1 token_pattern

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+/\w\w+\b') Values look like this 'heading/subheading##heading2/subheading2', there can be arbitrary heading/subheading values separated by ##. This produces

java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:45)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:239)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:144)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:112)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:198)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:45)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:239)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:144)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:112)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:198)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

None is the default for tokenizer here according to scikit-learn.

Approach 2 tokenizer

counttokenizer.py

import re
REGEX = re.compile(r'##')
def tokenize(text):
        return [ tok.strip() for tok in REGEX.split(text) ]

Now back in my main model building python file:

import counttokenizer
# ... previous code
vectorizer = CountVectorizer(tokenizer=counttokenizer.tokenize)
# ...nothing else changed from the previous code

This produces

java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has an unsupported value (Java class net.razorvine.pickle.objects.ClassDictConstructor)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:56)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:239)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:144)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:112)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:198)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDictConstructor to sklearn2pmml.feature_extraction.text.Splitter
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 10 more

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has an unsupported value (Java class net.razorvine.pickle.objects.ClassDictConstructor)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:56)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:239)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:144)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:112)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:198)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDictConstructor to sklearn2pmml.feature_extraction.text.Splitter
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:41)
    ... 10 more

I created counttokenizer.py to ensure the tokenizer function gets pickled.

Environment setup:

print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

0.20.2
0.13.0
1.8.0
0.42.0

Do you have any idea why I'm receiving these errors or have a workaround? Thank you in advance!

vruusmann commented 5 years ago

I'm trying to use a CountVectorizer in PMMLPipeline to split a column's values on ##

You're running into limitations, which have been placed there intentionally in order to ensure that the current Python representation and the future PMML representation would behave exactly the same way.

Specifically, the only supported "splitting configuration" is the one that is hard-coded as the sklearn2pmml.feature_extraction.text.Splitter class:

from sklearn2pmml.feature_extraction.text import Splitter

vectorizer = CountVectorizer(tokenizer = Splitter())

If you want to achieve custom splitting behaviour (such as using ## as delimiter), then you'd need to do one extra pre-processing on that text column first. For example, you could run regex transform, which replaces "##" with " " (the space character), so that the sklearn2pmml.feature_extraction.text.Splitter splitter class can do its job.

vruusmann commented 5 years ago

Here's a related feature request about RegEx transformers: https://github.com/jpmml/jpmml-sklearn/issues/81

amoldavsky commented 3 years ago

I am also running into this issue, the tokenizer is optional so long you supply the token_pattern in sklearn: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L340

@vruusmann what is restrictive on your end to adopt the same logic?

vruusmann commented 3 years ago

@amoldavsky The blocking matter is a conceptual incompatibility between Scikit-Learn and PMML:

Scikit-Learn: The user specifies a "token regex". The matching parts of the sentence are considered to represent tokens (ie. valuable content); all non-matching parts are considered to be whitespace.
PMML: The user specifies a "token delimiter regex". The matching parts are discarded, the non-matching parts are kept.

In RegEx terms:

Scikit-Learn: \w+ (defines a "word")
PMML: \W+(defines a "non-word")

The workaround appears to extend PMML's TextIndex element with a new attribute that can capture a Scikit-Learn compatible "wordRE": http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

Default/now:

<TextIndex wordSeparatorCharacterRE="\s+">
</TextIndex>

Vendor extension/future:

<TextIndex wordRE="\S+">
</TextIndex>

vruusmann commented 3 years ago

Great initiative, I've just formalized the request for the TextIndex@wordRE attribute as http://mantis.dmg.org/view.php?id=271

amoldavsky commented 3 years ago

@vruusmann thank you for the in-depth explanation! I will take your advise and try to utilize the a transformer to do some pre-processing to the text, I see that FunctionTransformer is supported, I will give it a try.

vruusmann commented 3 years ago

I'm not particularly hopeful about DMG.org taking action on the proposed TextIndex@wordRE attribute. But it's really a low-hanging fruit, and I can work on it without their approval. By convention, I'll prefix the attribute name with x- (to indicate its vendor extension-status).

It should be done & published in the next iteration (targeting mid-Jan 2021). Have it in the top position in my TODO file.

vruusmann commented 3 years ago

The TextIndex@x-wordRE vendor extension attribute is available starting from today:

Python side: SkLearn2PMML 0.66.0 and/or JPMML-SkLearn 1.6.11
Java side: JPMML-Evaluator 1.5.11

It's now possible to choose between two text tokenization modes.

First, the legacy/PMML tokenization mode as implemented by the sklearn2pmml.feature_extraction.text.Splitter callable type. The text is split into tokens using the specified word separator RE, tokens are trimmed of leading and trailing punctuation characters, empty tokens are discarded.

Example:

from sklearn2pmml.feature_extraction.text import Splitter

cv = CountVectorizer(token_pattern = None, tokenizer = Splitter("\\s+"))

Second, the new/Scikit-Learn tokenization mode as implemented by the sklearn2pmml.feature_extraction.text.Matcher callable type. The text is matched using the specified word RE, empty tokens are discarded.

Example:

from sklearn2pmml.feature_extraction.text import Matcher

cv = CountVectorizer(token_pattern = None, tokenizer = Matcher("\\w+"))

It is worth pointing out that the TextIndex@x-wordRE attribute enables support for the CountVectorizer.token_pattern attribute as well.

For example, the following two CountVectorizer instances are functionally identical:

from sklearn2pmml.feature_extraction.text import Matcher

cv1 = CountVectorizer()
cv2 = CountVectorizer(token_pattern = None, tokenizer = Matcher("(?u)\b\w\w+\b"))

vruusmann commented 3 years ago

See https://openscoring.io/blog/2021/01/17/converting_sklearn_tf_tfidf_pipeline_pmml/

jpmml / jpmml-sklearn

CountVectorizer tokenizer has missing (None/null) value and unsupported value errors #95

Approach 1 token_pattern

Approach 2 tokenizer