Closed HelloLadsAndGents closed 3 years ago
Caused by: org.jpmml.evaluator.InvalidElementException: Element DataField is not valid
This exception is thrown when the PMML file contains two or more DataField
elements that have the same name. This is not permitted, just like any regular programming language will not let you declare two or more variables with the same name (in the current scope).
this is the unusable pmml file (could be loaded by python but not java ) model_lgb_end_new.zip
This PMML file contains duplicate DataField
elements.
It must have been generated manually, because I refuse to believe that the JPMML-SkLearn/JPMML-Python/JPMML-Converter library stack will allow such thing to happen.
For example: https://github.com/jpmml/jpmml-converter/blob/1.4.4/src/main/java/org/jpmml/converter/PMMLEncoder.java#L101-L105 https://github.com/jpmml/jpmml-converter/blob/1.4.4/src/main/java/org/jpmml/converter/PMMLEncoder.java#L284-L296
All correct PMML engines should refuse to process this PMML file.
this is the usable pmml file(both java and python), to prove the program could work model_xgb_end.zip
This PMML file does not contain duplicate DataField
elements.
It must have been generated manually, because I refuse to believe that the JPMML-SkLearn/JPMML-Python/JPMML-Converter library stack will allow such thing to happen.
And yet DataFrameMapper
somehow finds a way to maneuver around these checks:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
("Sepal.Length", None),
("Petal.Length", None),
("Sepal.Length", None)
])),
("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)
The converter should at least issue a warning here about a duplicate Sepal.Length
feature.
I think that DecisionTreeClassifier
and XGBClassifier
do not care about duplicate fields declarations, because they contain some kind of feature redundancy detection mechanism (only use the first Sepal.Length
column, ignore the second Sepal.Length
column).
Apparently, LGBMClassifier
does not contain such protection, and happily uses all available columns.
Thanks , yes you are right ,we do change the PMML file manually because some features has been changed , modify pmml file will be more convenient for us tips : maybe the wrong log could be more specific for this situation ? and thank you for your professionanl advice
i still can't find out why the same lgbm pmml file could be loaded by python but java
we do change the PMML file manually because some features has been changed
When exporing the contexts of the /PMML/MiningBuildTask
element, then it becomes clear that your Python data science workflow is problematic, because it explicitly includes the same column multiple times:
DataFrameMapper([
('CP7227', None),
..
('CP7227', None)
])
You should delete the duplicate column mapping in your LightGBM pipeline, and everything should be resolved.
i still can't find out why the same lgbm pmml file could be loaded by python but java
The screenshot shows that you're using a PyPMML package there. The explanation is that it doesn't check the correctness of PMML files, and will happily eat all sorts of garbage you feed it.
If you need to score PMML files in Python, please consider switching to the JPMML-Evaluator-Python package. Both JPMML-Evaluator-Python and PyPMML are thin Python language wrappers around Java/Scala libraries.
Thank you so much !
something more intersting happened i got two pmml file ,one of them using lgbm and the other using xgboost
data for lgb lgb_test1112.zip related pmml file model_lgb_end.zip
file lgb_test1112 is a normal test dataset for model_lgb_end.pmml and file lgb_test1112_1 has only one unrelated tag which is not in the model_lgb_end.pmml
the strange part is that i can get a result with lgb_test1112_1 i can even take the data for xbg as an input for model_lgb_end.pmml and that's why i found the exception
data for xbg xgb_test1112.zip related pmml file model_xgb_end.zip
code:
import pandas
from pypmml import Model
df1 = pandas.read_csv("lgb_test1112_1")
df2 = pandas.read_csv("xgb_test1112")
model1 = Model.fromFile("model_lgb_end_11.pmml")
model2 = Model.fromFile("model_xgb_end(1).pmml")
model1.predict(df1)
model2.predict(df2)
model1.predict(df2)
model2.predict(df1)
does it still means XGBClassifier and LGBMClassifier has no protection for input data?
from pypmml import Model
@HelloLadsAndGents Please do not post PyPMML issues to the JPMML project (anywhere under https://github.com/jpmml).
This if the final warning. If you keep posting PyPMML issues here, you will be blocked for spamming.
got it
And yet DataFrameMapper somehow finds a way to maneuver around these checks
Despite the above comment (https://github.com/jpmml/jpmml-sklearn/issues/155#issuecomment-723248300), I'm unable to reproduce this behaviour using JPMML-SkLearn version 1.6.7.
How is it possible to bypass this DataField
element uniqueness check?
https://github.com/jpmml/jpmml-sklearn/blob/1.6.7/src/main/java/sklearn_pandas/DataFrameMapper.java#L63-L66
this is the java file PMMLPrediction_java.zip
this is the unusable pmml file (could be loaded by python but not java ) model_lgb_end_new.zip
this is the usable pmml file(both java and python), to prove the program could work model_xgb_end.zip
this is the wrong log: Caused by: org.jpmml.evaluator.InvalidElementException: Element DataField is not valid at org.jpmml.evaluator.IndexableUtil.buildMap(IndexableUtil.java:65) at org.jpmml.evaluator.IndexableUtil.buildMap(IndexableUtil.java:54) at org.jpmml.evaluator.ModelManager$2.load(ModelManager.java:620) at org.jpmml.evaluator.ModelManager$2.load(ModelManager.java:616) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3444) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2193) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2152) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2042) at com.google.common.cache.LocalCache.get(LocalCache.java:3850) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3874) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4799) at org.jpmml.evaluator.CacheUtil.getValue(CacheUtil.java:51) at org.jpmml.evaluator.ModelManager.(ModelManager.java:117)
at org.jpmml.evaluator.ModelEvaluator.(ModelEvaluator.java:88)
at org.jpmml.evaluator.mining.MiningModelEvaluator.(MiningModelEvaluator.java:114)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.jpmml.evaluator.ModelManagerFactory.newModelManager(ModelManagerFactory.java:100)
at org.jpmml.evaluator.ModelEvaluatorFactory.newModelEvaluator(ModelEvaluatorFactory.java:38)
at org.jpmml.evaluator.ModelEvaluatorBuilder.build(ModelEvaluatorBuilder.java:115)
at com.upsmart.common.utils.PMMLPrediction.(PMMLPrediction.java:31)
... 10 more
any ideas for this situation? thanks