Closed eddies5 closed 3 years ago
I'm trying to use a
CountVectorizer
inPMMLPipeline
to split a column's values on##
You're running into limitations, which have been placed there intentionally in order to ensure that the current Python representation and the future PMML representation would behave exactly the same way.
Specifically, the only supported "splitting configuration" is the one that is hard-coded as the sklearn2pmml.feature_extraction.text.Splitter
class:
from sklearn2pmml.feature_extraction.text import Splitter
vectorizer = CountVectorizer(tokenizer = Splitter())
If you want to achieve custom splitting behaviour (such as using ##
as delimiter), then you'd need to do one extra pre-processing on that text column first. For example, you could run regex transform, which replaces "##"
with " "
(the space character), so that the sklearn2pmml.feature_extraction.text.Splitter
splitter class can do its job.
Here's a related feature request about RegEx transformers: https://github.com/jpmml/jpmml-sklearn/issues/81
I am also running into this issue, the tokenizer
is optional so long you supply the token_pattern
in sklearn:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L340
@vruusmann what is restrictive on your end to adopt the same logic?
@amoldavsky The blocking matter is a conceptual incompatibility between Scikit-Learn and PMML:
In RegEx terms:
\w+
(defines a "word")\W+
(defines a "non-word")The workaround appears to extend PMML's TextIndex
element with a new attribute that can capture a Scikit-Learn compatible "wordRE": http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex
Default/now:
<TextIndex wordSeparatorCharacterRE="\s+">
</TextIndex>
Vendor extension/future:
<TextIndex wordRE="\S+">
</TextIndex>
Great initiative, I've just formalized the request for the TextIndex@wordRE
attribute as http://mantis.dmg.org/view.php?id=271
@vruusmann thank you for the in-depth explanation!
I will take your advise and try to utilize the a transformer to do some pre-processing to the text, I see that FunctionTransformer
is supported, I will give it a try.
I'm not particularly hopeful about DMG.org taking action on the proposed TextIndex@wordRE
attribute. But it's really a low-hanging fruit, and I can work on it without their approval. By convention, I'll prefix the attribute name with x-
(to indicate its vendor extension-status).
It should be done & published in the next iteration (targeting mid-Jan 2021). Have it in the top position in my TODO file.
The TextIndex@x-wordRE
vendor extension attribute is available starting from today:
It's now possible to choose between two text tokenization modes.
First, the legacy/PMML tokenization mode as implemented by the sklearn2pmml.feature_extraction.text.Splitter
callable type. The text is split into tokens using the specified word separator RE, tokens are trimmed of leading and trailing punctuation characters, empty tokens are discarded.
Example:
from sklearn2pmml.feature_extraction.text import Splitter
cv = CountVectorizer(token_pattern = None, tokenizer = Splitter("\\s+"))
Second, the new/Scikit-Learn tokenization mode as implemented by the sklearn2pmml.feature_extraction.text.Matcher
callable type. The text is matched using the specified word RE, empty tokens are discarded.
Example:
from sklearn2pmml.feature_extraction.text import Matcher
cv = CountVectorizer(token_pattern = None, tokenizer = Matcher("\\w+"))
It is worth pointing out that the TextIndex@x-wordRE
attribute enables support for the CountVectorizer.token_pattern
attribute as well.
For example, the following two CountVectorizer
instances are functionally identical:
from sklearn2pmml.feature_extraction.text import Matcher
cv1 = CountVectorizer()
cv2 = CountVectorizer(token_pattern = None, tokenizer = Matcher("(?u)\b\w\w+\b"))
I'm trying to use a
CountVectorizer
inPMMLPipeline
to split a column's values on##
, but when I callsklearn2pmml(...)
on my pipeline, I get an error. My model builds fine. I've tried two different approaches and each get me a different error.Code
Approach 1 token_pattern
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+/\w\w+\b')
Values look like this 'heading/subheading##heading2/subheading2', there can be arbitrary heading/subheading values separated by ##. This producesNone
is the default fortokenizer
here according to scikit-learn.Approach 2 tokenizer
counttokenizer.py
Now back in my main model building python file:
This produces
I created
counttokenizer.py
to ensure the tokenizer function gets pickled.Environment setup:
Do you have any idea why I'm receiving these errors or have a workaround? Thank you in advance!