jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

PMMLPipeline using 2 DataFrameMappers #156

Closed lqrz closed 3 years ago

lqrz commented 3 years ago

I'm using Lightgbm's LGBMClassifier as my model, which expects its categorical features in numerical format.

For that reason, I'm trying to implement a PMMLPipeline that has 2 DataFrameMappers as preprocessing_steps. The first one applies a LookupTransformer to map the categorical features values from string to int, and the second one applies the CategoricalDomain, ContinuousDomain decorators to my feature columns.

AFAIC, it is not possible to have 2 DataFrameMapper steps in the current sklearn2pmml version.

Is there a way to go about this?

Notice that I need the 2 DataFrameMappers to be executed in order (i.e. FeatureUnion does not apply here).

Thanks!


transforms = [ (col, LookupTransformer(mapping=mapping, default_value=unseen_encoding_code), ..., (col, None), ...]
type_decorations = [ (col, CategoricalDomain()), ..., (col, ContinuousDomain()), ... ]

preprocessing_pipeline_categorical_encoding = DataFrameMapper(
    features=transforms,
    sparse=False,
    df_out=True,
    input_df=True,
    drop_cols=None
)

preprocessing_pipeline_type_decorations = DataFrameMapper(
    features=type_decorations,
    sparse=False,
    df_out=True,
    input_df=True,
    drop_cols=None
)

pipeline = PMMLPipeline([
    ('preprocess_categorical_encoding', preprocessing_pipeline_categorical_encoding),
    ('preprocess_type_decorations', preprocessing_pipeline_type_decorations),
    ('classifier', LGBMClassifier(...))
])
lqrz commented 3 years ago

I tried passing the LookupTransformer and the decorator in a transformer list (this would have allowed me to use a single DataFrameMapper), but I get the following error:

transforms = [("FIELD_1", [ LookupTransformer(...), CategoricalDomain() ]), ...]

preprocessing_pipeline_categorical_encoding = DataFrameMapper(
    features=transforms,
    sparse=False,
    df_out=True,
    input_df=True,
    drop_cols=None
)

pipeline = PMMLPipeline([
    ('preprocess_categorical_encoding', preprocessing_pipeline_categorical_encoding),
    ('classifier', LGBMClassifier(...))
])

Error:

SEVERE: Failed to convert
java.lang.IllegalArgumentException: Field lookup(FIELD_1) is not decorable
    at sklearn2pmml.decoration.Domain.asWildcardFeature(Domain.java:215)
    at sklearn2pmml.decoration.CategoricalDomain.encodeFeatures(CategoricalDomain.java:76)
    at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
    at sklearn.Composite.encodeFeatures(Composite.java:129)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)
vruusmann commented 3 years ago

AFAIC, it is not possible to have 2 DataFrameMapper steps in the current sklearn2pmml version.

If I understand your workflow correctly, then there's nothing that would require two or more DataFrameMapper instances to be present - if you need to act on the same column multiple times, then you can simply add more "column actions" to the same instance.

Before - don't do this:

mapper1 = DataFrameMapper([
  ('col', CategoricalDomain())
])
mapper2 = DataFrameMapper([
  ('col', LookupTransformer())
])

After - do this:

mapper = DataFrameMapper([
  ('col', CategoricalDomain()),
  ('col', LookupTransformer())
])

The first one applies a LookupTransformer to map the categorical features values from string to int, and the second one applies the CategoricalDomain, ContinuousDomain decorators to my feature columns.

Why two mappings? Why don't you apply them sequentially?

mapper = DataFrameMapper([
  ('col', [CategoricalDomain(), LookupTransformer()]),
])

java.lang.IllegalArgumentException: Field lookup(FIELD_1) is not decorable

Decorator classes are designed to clarify the "intended interpretation" of a data column.

They can only be applied to data columns that enter the pipeline (DataField elements). They can not be applied to data columns that correspond to intermediate stages inside the pipeline (DerivedField elements).

You're trying to override the definition of a DerivedField element. Naturally, this cannot be allowed.

lqrz commented 3 years ago

Thanks for your quick response! If I understand correctly, what I need is to mark the output of my Lookup transformation as "categorical". Meaning, if I place the transformations in that order (i.e. ('col', [CategoricalDomain(), LookupTransformer()])) then I get this error:

java.lang.IllegalArgumentException: Expected a false (off) categorical split mask for continuous feature lookup(FIELD_1), got true (on)

Sorry for replying on a closed issue... but I would like to understand this. Thanks again.

vruusmann commented 3 years ago

what I need is to mark the output of my Lookup transformation as "categorical"

The output of a LookupTransformer is already categorical (maps a categorical value to another categorical value).

Are you perhaps trying to map categorical string values to categorical integer values using LookupTransformer? It wasn't designed for that - you should be using good old LabelEncoder.

Feeding a categorical string column to LightGBM:

mapper = DataFrameMapper([
  ('str_col', [CategoricalDomain(), LookupTransformer(), LabelEncoder()])
])

In the above example, LookupTransformers maps strings from one value space to another value space (eg. from 50 category levels to 4 category levels). The real conversion from string to integer (ordered integer, 0-based) happens with LabelEncoder.

lqrz commented 3 years ago

I am indeed using the lookup transformer for mapping str -> int. I chose it cause it would allow me to own and specify the mapping dictionary (which in this use case I need). I understand now it was a poor choice from my side.

Thanks for the clarification!

vruusmann commented 3 years ago

I chose it cause it would allow me to own and specify the mapping dictionary (which in this use case I need).

In principle, it should be possible to use LookupTransformer for "emulating" LabelEncoder. However, during conversion to PMML all integer values are replaced back with the original string values (eg. all SimpleSetPredicate elements operate on human-friendly class labels, not some numbers). And remember - this back-replacement is a feature, not a bug!

java.lang.IllegalArgumentException: Expected a false (off) categorical split mask for continuous feature lookup(FIELD_1), got true (on)

For some reason, your LightGBM models still thinks that "FIELD_1" is continuous, not a categorical.

Possible causes - the LookupTransformer didn't provide correct mapping. Perhaps there are gaps in the sequence?

Also, did you set LightGBM's categorical_feature fit parameter?

TLDR: You can keep experimenting with LookupTransformer, as it should work eventually. But LabelEncoder should be way easier (because you're dealing with a very standard operation).