jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Support for KBinsDiscretizer #116

Closed jesusvasquezdeveloper closed 3 years ago

jesusvasquezdeveloper commented 5 years ago

Hi, I'm using sklearn2pmml to persist a simple model.

numeric_features = ['column1','column2','column3']
categorical_features = ['column4']

num_mapper = sklearn_pandas.DataFrameMapper(
    [([numeric_column],KBinsDiscretizer(2)) for numeric_column in numeric_features],df_out=True)

categorical_mapper = sklearn_pandas.DataFrameMapper(
    [([categorical_column],LabelBinarizer()) for categorical_column in categorical_features],df_out=True)

preprocessing = FeatureUnion(transformer_list=[('num_mapper',num_mapper),('cat_mapper',categorical_mapper)])

pmmlpipeline = PMMLPipeline(steps=[
    ('preprocessing',preprocessing),
    ('cluster',KMeans(n_clusters=5))
])

pmmlpipeline.fit(df)

As always i divide my data into categorical and numeric features and apply the respective preprocessing step to each. Note that i need to discretize all of my numeric variables .I manage to preprocess my data and fit the algorithm. But when i try to persist the model, the code yields this exception.

java.lang.IllegalArgumentException: The value object (Python class sklearn.preprocessing._discretization.KBinsDiscretizer) is not a supported Transformer
    at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
    at sklearn_pandas.DataFrameMapper.getTransformerList(DataFrameMapper.java:169)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:71)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:85)
    at sklearn.pipeline.FeatureUnion.encodeFeatures(FeatureUnion.java:45)
    at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:85)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:83)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:203)
    at org.jpmml.sklearn.Main.run(Main.java:145)

Is there any other straight way to discretize dataframe columns in PMMLPipeline?

vruusmann commented 5 years ago

Is there any other straight way to discretize dataframe columns

You could calculate bin thresholds manually, and then construct a sklearn2pmml.preprocessing.CutTransformer (a wrapper around the pandas.cut function): https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/preprocessing/__init__.py#L52-L66

Why are you using two separate DataFrameMapper instances (and joining them using FeatureUnion afterwards)? A single DataFrameMapper instance can hold mappings both for continuous and categorical columns.