jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Detect and reject duplicate features #155

Closed HelloLadsAndGents closed 3 years ago

HelloLadsAndGents commented 3 years ago

this is the java file PMMLPrediction_java.zip

this is the unusable pmml file (could be loaded by python but not java ) model_lgb_end_new.zip

this is the usable pmml file(both java and python), to prove the program could work model_xgb_end.zip

this is the wrong log: Caused by: org.jpmml.evaluator.InvalidElementException: Element DataField is not valid at org.jpmml.evaluator.IndexableUtil.buildMap(IndexableUtil.java:65) at org.jpmml.evaluator.IndexableUtil.buildMap(IndexableUtil.java:54) at org.jpmml.evaluator.ModelManager$2.load(ModelManager.java:620) at org.jpmml.evaluator.ModelManager$2.load(ModelManager.java:616) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3444) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2193) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2152) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2042) at com.google.common.cache.LocalCache.get(LocalCache.java:3850) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3874) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4799) at org.jpmml.evaluator.CacheUtil.getValue(CacheUtil.java:51) at org.jpmml.evaluator.ModelManager.(ModelManager.java:117) at org.jpmml.evaluator.ModelEvaluator.(ModelEvaluator.java:88) at org.jpmml.evaluator.mining.MiningModelEvaluator.(MiningModelEvaluator.java:114) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.jpmml.evaluator.ModelManagerFactory.newModelManager(ModelManagerFactory.java:100) at org.jpmml.evaluator.ModelEvaluatorFactory.newModelEvaluator(ModelEvaluatorFactory.java:38) at org.jpmml.evaluator.ModelEvaluatorBuilder.build(ModelEvaluatorBuilder.java:115) at com.upsmart.common.utils.PMMLPrediction.(PMMLPrediction.java:31) ... 10 more

any ideas for this situation? thanks

vruusmann commented 3 years ago

Caused by: org.jpmml.evaluator.InvalidElementException: Element DataField is not valid

This exception is thrown when the PMML file contains two or more DataField elements that have the same name. This is not permitted, just like any regular programming language will not let you declare two or more variables with the same name (in the current scope).

this is the unusable pmml file (could be loaded by python but not java ) model_lgb_end_new.zip

This PMML file contains duplicate DataField elements.

It must have been generated manually, because I refuse to believe that the JPMML-SkLearn/JPMML-Python/JPMML-Converter library stack will allow such thing to happen.

For example: https://github.com/jpmml/jpmml-converter/blob/1.4.4/src/main/java/org/jpmml/converter/PMMLEncoder.java#L101-L105 https://github.com/jpmml/jpmml-converter/blob/1.4.4/src/main/java/org/jpmml/converter/PMMLEncoder.java#L284-L296

All correct PMML engines should refuse to process this PMML file.

this is the usable pmml file(both java and python), to prove the program could work model_xgb_end.zip

This PMML file does not contain duplicate DataField elements.

vruusmann commented 3 years ago

It must have been generated manually, because I refuse to believe that the JPMML-SkLearn/JPMML-Python/JPMML-Converter library stack will allow such thing to happen.

And yet DataFrameMapper somehow finds a way to maneuver around these checks:

pipeline = PMMLPipeline([
    ("mapper", DataFrameMapper([
        ("Sepal.Length", None),
        ("Petal.Length", None),
        ("Sepal.Length", None)
    ])),
    ("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)

The converter should at least issue a warning here about a duplicate Sepal.Length feature.

vruusmann commented 3 years ago

I think that DecisionTreeClassifier and XGBClassifier do not care about duplicate fields declarations, because they contain some kind of feature redundancy detection mechanism (only use the first Sepal.Length column, ignore the second Sepal.Length column).

Apparently, LGBMClassifier does not contain such protection, and happily uses all available columns.

HelloLadsAndGents commented 3 years ago

Thanks , yes you are right ,we do change the PMML file manually because some features has been changed , modify pmml file will be more convenient for us tips : maybe the wrong log could be more specific for this situation ? and thank you for your professionanl advice

i still can't find out why the same lgbm pmml file could be loaded by python but java image

vruusmann commented 3 years ago

we do change the PMML file manually because some features has been changed

When exporing the contexts of the /PMML/MiningBuildTask element, then it becomes clear that your Python data science workflow is problematic, because it explicitly includes the same column multiple times:

DataFrameMapper([
  ('CP7227', None),
  ..
  ('CP7227', None)
])

You should delete the duplicate column mapping in your LightGBM pipeline, and everything should be resolved.

i still can't find out why the same lgbm pmml file could be loaded by python but java

The screenshot shows that you're using a PyPMML package there. The explanation is that it doesn't check the correctness of PMML files, and will happily eat all sorts of garbage you feed it.

If you need to score PMML files in Python, please consider switching to the JPMML-Evaluator-Python package. Both JPMML-Evaluator-Python and PyPMML are thin Python language wrappers around Java/Scala libraries.

HelloLadsAndGents commented 3 years ago

Thank you so much !

HelloLadsAndGents commented 3 years ago

something more intersting happened i got two pmml file ,one of them using lgbm and the other using xgboost

data for lgb lgb_test1112.zip related pmml file model_lgb_end.zip

file lgb_test1112 is a normal test dataset for model_lgb_end.pmml and file lgb_test1112_1 has only one unrelated tag which is not in the model_lgb_end.pmml

the strange part is that i can get a result with lgb_test1112_1 i can even take the data for xbg as an input for model_lgb_end.pmml and that's why i found the exception

data for xbg xgb_test1112.zip related pmml file model_xgb_end.zip

code:

import pandas
from pypmml import Model

df1 = pandas.read_csv("lgb_test1112_1")
df2 = pandas.read_csv("xgb_test1112")

model1 = Model.fromFile("model_lgb_end_11.pmml")
model2 = Model.fromFile("model_xgb_end(1).pmml")

model1.predict(df1)
model2.predict(df2)

model1.predict(df2)
model2.predict(df1)

does it still means XGBClassifier and LGBMClassifier has no protection for input data?

vruusmann commented 3 years ago

from pypmml import Model

@HelloLadsAndGents Please do not post PyPMML issues to the JPMML project (anywhere under https://github.com/jpmml).

This if the final warning. If you keep posting PyPMML issues here, you will be blocked for spamming.

HelloLadsAndGents commented 3 years ago

got it

HelloLadsAndGents commented 3 years ago

related to https://github.com/autodeployai/pypmml/issues/19

vruusmann commented 3 years ago

And yet DataFrameMapper somehow finds a way to maneuver around these checks

Despite the above comment (https://github.com/jpmml/jpmml-sklearn/issues/155#issuecomment-723248300), I'm unable to reproduce this behaviour using JPMML-SkLearn version 1.6.7.

How is it possible to bypass this DataField element uniqueness check? https://github.com/jpmml/jpmml-sklearn/blob/1.6.7/src/main/java/sklearn_pandas/DataFrameMapper.java#L63-L66