Open selbenna opened 4 years ago
Let's solve one issue at a time. Right now, the SkLearn2PMML/JPMML-SkLearn stack is complaining an unexpected RandomForestClassifier.classes_
attribute value.
Why are you fitting a LabelEncoder
object separately (and then passing its transformation results to the (PMML)Pipeline.fit(X, y)
method)? Why don't you simply do the following?
raw_data = pd.read_csv("data_columns.csv")
X = raw_data["name"]
y = raw_data["type"]
pipeline = PMMLPipeline([
("transformer", vectorizer),
("classifier",rfc)
])
pipeline.fit(X, y)
Thank you for your response!
I tried what you suggested and I get this error now:
Standard output is empty
Standard error:
janv. 08, 2020 9:47:22 AM org.jpmml.sklearn.Main run
INFOS: Parsing PKL..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Parsed PKL in 53752 ms.
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
INFOS: Converting..
janv. 08, 2020 9:48:16 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
**java.lang.IllegalArgumentException: char_wb**
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.Composite.encodeFeatures(Composite.java:129)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Exception in thread "main" java.lang.IllegalArgumentException: char_wb
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:153)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeDefineFunction(TfidfVectorizer.java:84)
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:77)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.Composite.encodeFeatures(Composite.java:129)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94
What do you think? Thanks!
java.lang.IllegalArgumentException: char_wb
That's the exception we've been looking for - it means that the "char_wb" text analyzer type is currently not supported.
Here's what can be done about it:
1 - I already tried with the "char" text analyzer and I get the same error.
2 - I thought about this idea but wasn't sure it was going to work. Thanks for suggesting it, i'm going to try it and will let you know :)
Thanks
1 - I already tried with the "char" text analyzer and I get the same error.
The only supported text analyzer mode is "word".
Try to transform your text input column so that it could be regarded as a collection of words. The simplest solution would be to use a regex that surrounds every "useful" character with whitespace character.
It should be possible to upgrade the SkLearn2PMML/JPMML-SkLearn stack to support "char" and "char_wb" text analyzers as well. However, this is not a priority for me.
Leaving this issue open, in case my priorities change.
I used a lambda function to split the column names into characters like this:
raw_data["name"] = raw["name"].apply(lambda word: " ".join(word)) x = original_data["name"]
Then I used the pipeline above with the 'word' analyzer and I don't have the error anymore.
But I don't know how to include this preprocessing to the pipeline ... Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?
Could you please help?
Thanks
Can I add this lambda function in the PMMLpipeline using ExpressionTransformer feature?
Probably not, because your lambda uses Python language constructs/functions that are not supported by ExpressionTransformer
yet.
It should be possible using special-purpose string transformers. I'd try to formalize a regular expression (regex) pattern, and apply it to the original (aka raw) text feature using the sklearn2pmml.preprocessing.ReplaceTransformer
.
Some regex that inserts whitespace characters into a string.
I found a regex patter to insert whitespace characters:
regex_pattern = "(?<!^)(\B|b)(?!$)"
transformer = ReplaceTransformer(regex_pattern, " ")
vectorizer = TfidfVectorizer(analyzer = "word", ngram_range=(1, 2), preprocessor = None,lowercase = False,
tokenizer = Splitter(), norm = None)
pipeline = PMMLPipeline([
("transformer", transformer),
("preprocessing", vectorizer),
("classifier",rfc) ])
pipeline.fit(x, y)
but i get this error:
TypeError: cannot use a string pattern on a bytes-like object`
How should I fit the data into this pipeline?
Thanks a lot for your help.
but i get this error:
That's a 100% Python language stack error (not (J)PMML one). It means that you're mixing/confusing str
and bytes
data types somewhere.
Perhaps you need to convert a bytes
object to a str
object by specifying what it the intended character encoding. Also, upgrading from Python 2.7 to Python 3.X might solve the issue.
I found the error, by trying to inject the output of the transformer which is a ndarray object to the TfidfVectorizer I get:
TypeError: cannot use a string pattern on a bytes-like object
So I convert the ndarray to a Series:
transformed_data = transformer(x)
array_to_series = map(lambda x: x[0], transformed_data) ser = pd.Series(array_to_series)
I apply the TfidfVectorizer to ser
X = vectorizer.fit_transform(ser)
and I don't get the error anymore.
So my question is how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline please?
pipeline = PMMLPipeline([ ("transformer", transformer), ("preprocessing", vectorizer), ("classifier",rfc) ])
Should I insert another transformation between te transformer and the vectorizer?
Thank you!
Are you using the latest SkLearn2PMML package version? The ReplaceTransformer
transformer should be returning a single-column 2-D Numpy array currently (the _col2d(X)
utility method):
https://github.com/jpmml/sklearn2pmml/blob/0.52.1/sklearn2pmml/preprocessing/__init__.py#L295
I wonder why the TfidfVectorizer
step doesn't like it. Perhaps the ReplaceTransformer
transformer should use some other return type/configuration.
how can I do this conversion from ndarray output of the transformer to a series object to feed it the vectorizer in the pipeline
Have you tried wrapping the whole feature engineering into sklearn_pandas.DataFrameMapper
or sklearn.compose.ColumnTransformer
? These meta-transformers are pretty good at reshaping data between steps.
Something like this:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["name"], [ReplaceTransformer(..), TfidfVectorizer(..)])
])),
("classifier", RandomForestClassifier())
])
Yes I'm using the latest Sklearn2PMML. the ReplaceTransformer
returns a 2-D numpy array and that's the problem, the TfidVectorizer doesn't like it.
I tried:
`pipeline = PMMLPipeline([ ("mapper", DataFrameMapper([ (["name"], [ReplaceTransformer("(?<!^)(\B)(?!$)", " "), vectorizer]) ])), ("classifier", rfc) ])
pipeline.fit(raw_data,y)`
But I get this error:
TypeError: ['name']: cannot use a string pattern on a bytes-like object
I tried also with ColumnTransformer and I get the same error.
Is there a way I could convert the output of _col2d(x) using this function inside the pipeline?
array_to_series = map(lambda x: x[0], transform(x))
ser = pd.Series(array_to_series)
This ser variable would then be fed to the vectorizer:
vectorizer.fit_transform(ser)
Thank you!
Hi Villu,
I'm trying to create a pmml file from the sklearn model below. I use TfidfVectorizer on character level and a random forest classifier. The model predicts the datatype of column based on the name of this column. It works just fine with configuration below but when I create a PMML pipeline I get an error.
Here's my code:
I get this error:
Could you please tell me what's wrong in the vectorizer? Or how could I use character level analyzer in TfidfVectorizer? Thank you,
Sarah