Closed kornilcdima closed 3 years ago
On Java it throws an error when a new value appears: org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"
This invalid value check is performed in relation to DataField
elements (primary input). You are generating DerivedField
elements (secondary input).
It is assumed that only the primary input data can be invalid (eg. the CSV cell contains a mis-spelled value). It does not make sense to generate invalid values intentionally during data pre-processing.
Once again, what kind of OrdinalEncoder
class are you using here? Is it sklearn.preprocessing.OrdinalEncoder
or categorical_encoders.OrdinalEncoder
?
Once again, what kind of
OrdinalEncoder
class are you using here? Is itsklearn.preprocessing.OrdinalEncoder
orcategorical_encoders.OrdinalEncoder
?
sklearn.preprocessing.OrdinalEncoder
On Java it throws an error when a new value appears: org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"
This invalid value check is performed in relation to
DataField
elements (primary input). You are generatingDerivedField
elements (secondary input).It is assumed that only the primary input data can be invalid (eg. the CSV cell contains a mis-spelled value). It does not make sense to generate invalid values intentionally during data pre-processing.
Sorry, I don't follow. In my case the PMML-pipeline gets a columns which has new values (which were never seen before by the model). And the PMML-pipeline breaks. However, it shouldn't as I understand. The expected behavior is that PMML replaces unknown values with -1. And when you do the same thing on Python, everything works. Although, that is not the case on PMML.
Other fields of the pmml-file:
<MiningField name="Netspeed" importance="52.0" missingValueReplacement="missing_value" missingValueTreatment="asValue"/>
<DataField name="Netspeed" optype="categorical" dataType="string">
<Value value="-1" property="missing"/>
<Value value="CABLE_DSL"/>
<Value value="CELLULAR"/>
</DataField>
I guess I should get something like this?
<MiningField name="Netspeed" importance="59.0" missingValueReplacement="Unknown" missingValueTreatment="asIs" invalidValueTreatment="asMissing"/>
I am able to get this only with the following processing.
cat_transformer = PMMLPipeline(steps=[
('cd', CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown")),
('oe', PMMLLabelEncoder(missing_values=-1))
])
num_transformer = PMMLPipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
])
transformers = [(f'{i}_tr', cat_transformer, [i]) for i in cat_cols] \
+ [(f'{i}_tr', num_transformer, [i]) for i in num_cols]
model = lgb.LGBMRegressor(**m.model_hyperprams)
preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')
pipeline = PMMLPipeline([
('preprocessor', preprocessor),
('model', model)
])
pipeline.fit(X, y)
Sorry, I don't follow.
TLDR: You should perform validity check just once, when the data enters the pipeline. There is no point in accepting invalid data, performing a transformation on it (eg. ordinal encoding) and then checking if the transformation results is valid or not.
And the PMML-pipeline breaks. However, it shouldn't as I understand.
The JPMML-SkLearn library is not taking the OrdinalEncoder.handle_unknown
attribute into consideration right now.
At minimum, it should throw an IllegalArgumentException just to inform you that this piece of Scikit-Learn functionality cannot be used.
The expected behavior is that PMML replaces unknown values with -1.
Feel free to submit a PR.
I guess I should get something like this?
You clearly do not understand the difference between missing values and invalid values. Please consult with the basics here: http://dmg.org/pmml/v4-3/MiningSchema.html
This issue looks like a duplicate of https://github.com/jpmml/sklearn2pmml/issues/289
@kornilcdima If you want me to pay attention to something, then you can't just open/close/open issues randomly. This is your last warning.
This issue looks like a duplicate of jpmml/sklearn2pmml#289
@kornilcdima If you want me to pay attention to something, then you can't just open/close/open issues randomly. This is your last warning.
Thank you for your time and fast answers. The issue is not a duplicate. May be It looks like a duplicate because it is connected with the same part of the code and where I had a question. But my last question was related to a different problem which I unfortunately failed to solve. In the previous issue I was asking about handling missing values in the input and it was not connected with Encoder directly. Of course you are the moderator there, but I wouldn't name these two topics by the same title.
Hello @vruusmann,
I faced with a problem. When I convert sklearn-pipeline to pmml-pipeline the option of handling unknown values for categorical columns disappears. Do you know how to deal with this problem? Appreciate any help.
sklearn-pipeline:
When I look at pmml-file I don't see that it handles unknown values.
Everything works as expected on Python but it is not the same on Java (jpmml-pipeline). On Java it throws an error when a new value appears:
org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"