Closed nejatb closed 7 years ago
The fact that PMML "contains" less columns than the original Python data matrix is a feature, not a bug. The PMML conversion library keeps track which features are actually used by the model, and eliminates all unused ones. This leads to smaller PMML files that perform better.
Isn't that what you want?
("selector", SelectKBest(chi2,k=500))
It reads: "Keep the best 500 features, and discard all others". If you compare (J)PMML and Scikit-Learn predictions, then they will be exactly the same (absolute/relative precision of 1e-13
or better) - so there's no issue here.
It's also worth pointing out that several model types may trigger further feature elimination. The prime example are all decision trees and their ensembles (eg. random forests, gradient boosted trees) - if a feature is not involved in any tree splits, then it's also eliminated from the feature set (why compute a feature when it's not used by the prediction logic?).
The JPMML-Converter library, which underlies the JPMML-SkLearn library, inspects the (Lib)SVM data structure to identify no-op columns (pay attention to the featureMask
local variable):
https://github.com/jpmml/jpmml-converter/blob/master/src/main/java/org/jpmml/converter/support_vector_machine/LibSVMUtil.java#L119-L194
Therefore, even if you originally intended to retain 500 features, then the final PMML file may "only" contain 350 to 400 features - the "missing" features were eliminated because the (Lib)SVM data structure indicates that they have no discriminative power.
Hi. Sorry I didnt find any groups to make an issue in. I have written a PMMLPipeline as follows but after the model is created, I only see field1 as the input field. All inputs come with both fields. Do you have any ideas as to what might be causing this issue? ...