jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

ContinuousDomain with multiple missing_values #79

Closed dibus2 closed 6 years ago

dibus2 commented 6 years ago

Hi,

I am trying to achieve something along the following lines:

PMMLPipeline([
  ('Domains', DataFrameMapper([
     (['Var1'], ContinuousDomain(missing_values=[np.nan, -9999, -9998, -9997], missing_value_replacement=-99))])
])

in which there are several values that could be considered as missing_values. What would be the best way to approach this? Similarly, I would like to create a feature which is tracking is there are missing values in a column:

def impute_missing_flag_col(X):
    return X.isnull().astype(int)

and then use a FunctionTransformer for instance:

DataMapper([
(['Var2'], [FunctionTransformer(impute_missing_flag_col)], {'input_df': True})
])

This works in python but I cannot export it to PMML.

Thanks for you suggestions.

Cheers,

F.

vruusmann commented 6 years ago

.. in which there are several values that could be considered as missing_values. What would be the best way to approach this?

It never occurred to me that Domain.missing_values should be an array-like attribute instead of a scalar attribute.

Can't think of a good workaround at the moment - needs to be done "properly", by actually implementing it into Python and Java sides of the codebase.

I would like to create a feature which is tracking is there are missing values in a column

There is a non-standard transformer class sklearn2pmml.preprocessing.ExpressionTransformer, which lets you check "missingness" usingpandas.isnull(X) and pandas.notnull(X) functions: https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/preprocessing/__init__.py#L44-L54

Something like this should do:

from sklearn2pmml.preprocessing import ExpressionTransformer

DataFrameMapper([
  (['Var2'], ExpressionTransformer("pandas.isnull(X['Var2'])"))
])