jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Cannot rename data fields using the `Alias` decorator #154

Closed testlambda693 closed 3 years ago

testlambda693 commented 3 years ago

Hi Villu,

i am trying to convert a feature to PMML and trying to perform the following sklearn2pmml(pipeline, out_file)

i get the following error

ov 03, 2020 7:35:02 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert PKL to PMML
java.lang.IllegalArgumentException: User input field accountCity cannot be renamed
    at org.jpmml.sklearn.SkLearnEncoder.renameFeature(SkLearnEncoder.java:75)
    at sklearn2pmml.decoration.Alias.encodeFeatures(Alias.java:60)
    at sklearn.Transformer.encode(Transformer.java:70)
    at sklearn.compose.ColumnTransformer.encodeFeatures(ColumnTransformer.java:63)
    at sklearn.Transformer.encode(Transformer.java:70)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
    at sklearn.Initializer.encodeFeatures(Initializer.java:48)
    at sklearn.Transformer.encode(Transformer.java:70)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:210)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

This is my code

# check3
# def create_col_missmatch(mapper: DataFrameMapper, city1:str, city2:str , new_field:str):
#     feature = (
#     [city1,city2],
#     [               ColumnTransformer([
#                         ('city1',CategoricalDomain(dtype=str,with_data=False), [0]),
#                         ('city2',CategoricalDomain(dtype=str,with_data=False), [1]),
#                          ]),

#                     ColumnTransformer([
#                         ('city1',imputer2(), [0]),
#                         ('city2',imputer2('Miss2'), [1]),
#                          ]),

#                     ColumnTransformer([
#                         ('city1', StringNormalizer(function = "lowercase"), [0]),
#                         ('city2', StringNormalizer(function = "lowercase"), [1]),
#                     ])
#                     ,ExpressionTransformer(f"1 if (X[0] != X[1])  else 0")
#                     ,Alias(CastTransformer(int), name=new_field)
#     ]
#                 ,{'alias': new_field})
#     mapper.features = mapper.features + [feature]

def imputer2(value:str="Miss"):
    global counter
    imputation = Alias(SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=value), name = f'imputer_{counter}')
    counter+=1
    return imputation

but when i run the following i have no errors

# check2
def create_col_missmatch(mapper: DataFrameMapper, city1:str, city2:str , new_field:str):
    feature = (
    [city1,city2],
    [               ColumnTransformer([
                        ('city1',CastTransformer(str), [0]),
                        ('city2',CastTransformer(str), [1]),
                         ]),
                    ColumnTransformer([
                        ('city1',imputer2(), [0]),
                        ('city2',imputer2('Miss2'), [1]),
                         ]),

                    ColumnTransformer([
                        ('city1', StringNormalizer(function = "lowercase"), [0]),
                        ('city2', StringNormalizer(function = "lowercase"), [1]),
                    ])
                    ,ExpressionTransformer(f"1 if (X[0] != X[1])  else 0")
                    ,Alias(CastTransformer(int), name=new_field)
    ]
                ,{'alias': new_field})
    mapper.features = mapper.features + [feature]

Thanks

vruusmann commented 3 years ago

FFS, the top line of the Java stack trace indicates that this error is thrown by the JPMML-SkLearn library (org.jpmml.sklearn.*). Why report it against the JPMML-Python library (org.jpmml.python.*) then?

vruusmann commented 3 years ago

Corrected the title of the issue - this error is thrown to indicate that the direct input column(s) to the pipeline cannot be renamed.

In PMML speak, you can rename DerivedField elements, but you cannot rename DataField elements. It's a small technical restriction in the current state of JPMML conversion libraries, where not all field references have been made properly rename/relocation-proof. Perhaps it will be lifted in future versions

Right now, closing as "won't fix". If you're unhappy with direct input column names, then why don't you adjust your pandas.DataFrame column names accordingly? For example, if you have a column called "A" in pandas.DataFrame, then the first operation in your pipeline should not be to rename it to something else.

vruusmann commented 3 years ago

The workaround here - the [CategoricalDomain(), Alias(SimpleImputer(strategy = "constant", fill_value = $value))] Python code fragment is not very effective. Better rewrite it as [CategoricalDomain(missing_value_replacement = $value)]

vruusmann commented 3 years ago

Also, if you're dealing with string input columns, then SimpleImputer(missing_values=np.nan) is not effective at all. A string column cannot contain Numpy NaN values, ever.