jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Does jpmml-sparkml uses any feature selection algorithm #84

Closed kushNumberTheory closed 5 years ago

kushNumberTheory commented 5 years ago

I am running into issue where my input data has around 30 columns and in my pmml it shows around 20 columns. Is there any specific feature selection algo runs before generating pmml. I know for categorical columns we should apply OHE and then it will be shown as categorical columns. But i have data with all columns as continuous still it shows few columns. Any reason for that? I have tried generating pmml with different kind of data, but it doesn't show all columns. My pipeline has vectorAssembler and TrainValidationModel. jpmml-sparkml - version: 1.2.X spark version: 2.1.0

vruusmann commented 5 years ago

Is there any specific feature selection algo runs before generating pmml.

There is an algo that implements "clear DataDictionary and TransformationDictionary elements from all field elements that are not needed by the model".

This is a "passive" PMML content compaction/optimization logic. This is not "active" ML feature selection logic.

my input data has around 30 columns and in my pmml it shows around 20 columns.

It means that approximately 10 of your input columns are provably redundant/useless in a sense that they are not needed by the final model. Why carry this baggage around?

TLDR: This is a feature, not a bug.

kushNumberTheory commented 5 years ago

@vruusmann I looked at code and got some understanding ,would you please verify it. Just correct me if i made mistake while understanding. For Regression: columns which has 0 coefficients will not be shown in For Classification: only those column will be shown in pmml which are at non-leaf node. i.e only those columns on which split has occurred

vruusmann commented 5 years ago

@kushNumberTheory You've got the correct point made.

A minor detail though - we're talking about model types, not mining functions. So, your first statement applies to the RegressionModel element, and your second statement to the TreeModel element (both classification- and regression-type decision tree models).

Seems correct/desired behaviour, no? Why would you want to have 0 coefficients in your regression model?

kushNumberTheory commented 5 years ago

@vruusmann Actually I was just curious to get my HOW? and WHY? You are absolutely correct for removing features which doesn't have any significance to the model and also it reduces the size of pmml

liumy601 commented 3 years ago

@vruusmann is it possible to add a switch option for this? because at first, it'll be curious to people ,as there's not any instructions about this.