Closed kushNumberTheory closed 5 years ago
Is there any specific feature selection algo runs before generating pmml.
There is an algo that implements "clear DataDictionary
and TransformationDictionary
elements from all field elements that are not needed by the model".
This is a "passive" PMML content compaction/optimization logic. This is not "active" ML feature selection logic.
my input data has around 30 columns and in my pmml it shows around 20 columns.
It means that approximately 10 of your input columns are provably redundant/useless in a sense that they are not needed by the final model. Why carry this baggage around?
TLDR: This is a feature, not a bug.
@vruusmann I looked at code and got some understanding ,would you please verify it. Just correct me if i made mistake while understanding.
For Regression: columns which has 0 coefficients will not be shown in
@kushNumberTheory You've got the correct point made.
A minor detail though - we're talking about model types, not mining functions. So, your first statement applies to the RegressionModel
element, and your second statement to the TreeModel
element (both classification- and regression-type decision tree models).
Seems correct/desired behaviour, no? Why would you want to have 0 coefficients in your regression model?
@vruusmann Actually I was just curious to get my HOW? and WHY? You are absolutely correct for removing features which doesn't have any significance to the model and also it reduces the size of pmml
@vruusmann is it possible to add a switch option for this? because at first, it'll be curious to people ,as there's not any instructions about this.
I am running into issue where my input data has around 30 columns and in my pmml it shows around 20 columns. Is there any specific feature selection algo runs before generating pmml.
I know for categorical columns we should apply OHE and then it will be shown as categorical columns. But i have data with all columns as continuous still it shows few columns. Any reason for that?
I have tried generating pmml with different kind of data, but it doesn't show all columns.
My pipeline has vectorAssembler and TrainValidationModel.
jpmml-sparkml - version: 1.2.X
spark version: 2.1.0