Closed lqrz closed 3 years ago
I tried passing the LookupTransformer and the decorator in a transformer list (this would have allowed me to use a single DataFrameMapper), but I get the following error:
transforms = [("FIELD_1", [ LookupTransformer(...), CategoricalDomain() ]), ...]
preprocessing_pipeline_categorical_encoding = DataFrameMapper(
features=transforms,
sparse=False,
df_out=True,
input_df=True,
drop_cols=None
)
pipeline = PMMLPipeline([
('preprocess_categorical_encoding', preprocessing_pipeline_categorical_encoding),
('classifier', LGBMClassifier(...))
])
Error:
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Field lookup(FIELD_1) is not decorable
at sklearn2pmml.decoration.Domain.asWildcardFeature(Domain.java:215)
at sklearn2pmml.decoration.CategoricalDomain.encodeFeatures(CategoricalDomain.java:76)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
at sklearn.Initializer.encodeFeatures(Initializer.java:41)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.Composite.encodeFeatures(Composite.java:129)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
AFAIC, it is not possible to have 2 DataFrameMapper steps in the current sklearn2pmml version.
If I understand your workflow correctly, then there's nothing that would require two or more DataFrameMapper
instances to be present - if you need to act on the same column multiple times, then you can simply add more "column actions" to the same instance.
Before - don't do this:
mapper1 = DataFrameMapper([
('col', CategoricalDomain())
])
mapper2 = DataFrameMapper([
('col', LookupTransformer())
])
After - do this:
mapper = DataFrameMapper([
('col', CategoricalDomain()),
('col', LookupTransformer())
])
The first one applies a LookupTransformer to map the categorical features values from string to int, and the second one applies the CategoricalDomain, ContinuousDomain decorators to my feature columns.
Why two mappings? Why don't you apply them sequentially?
mapper = DataFrameMapper([
('col', [CategoricalDomain(), LookupTransformer()]),
])
java.lang.IllegalArgumentException: Field lookup(FIELD_1) is not decorable
Decorator classes are designed to clarify the "intended interpretation" of a data column.
They can only be applied to data columns that enter the pipeline (DataField
elements). They can not be applied to data columns that correspond to intermediate stages inside the pipeline (DerivedField
elements).
You're trying to override the definition of a DerivedField
element. Naturally, this cannot be allowed.
Thanks for your quick response!
If I understand correctly, what I need is to mark the output of my Lookup transformation as "categorical". Meaning, if I place the transformations in that order (i.e. ('col', [CategoricalDomain(), LookupTransformer()])
) then I get this error:
java.lang.IllegalArgumentException: Expected a false (off) categorical split mask for continuous feature lookup(FIELD_1), got true (on)
Sorry for replying on a closed issue... but I would like to understand this. Thanks again.
what I need is to mark the output of my Lookup transformation as "categorical"
The output of a LookupTransformer
is already categorical (maps a categorical value to another categorical value).
Are you perhaps trying to map categorical string values to categorical integer values using LookupTransformer
? It wasn't designed for that - you should be using good old LabelEncoder
.
Feeding a categorical string column to LightGBM:
mapper = DataFrameMapper([
('str_col', [CategoricalDomain(), LookupTransformer(), LabelEncoder()])
])
In the above example, LookupTransformers
maps strings from one value space to another value space (eg. from 50 category levels to 4 category levels). The real conversion from string to integer (ordered integer, 0-based) happens with LabelEncoder
.
I am indeed using the lookup transformer for mapping str -> int. I chose it cause it would allow me to own and specify the mapping dictionary (which in this use case I need). I understand now it was a poor choice from my side.
Thanks for the clarification!
I chose it cause it would allow me to own and specify the mapping dictionary (which in this use case I need).
In principle, it should be possible to use LookupTransformer
for "emulating" LabelEncoder
. However, during conversion to PMML all integer values are replaced back with the original string values (eg. all SimpleSetPredicate
elements operate on human-friendly class labels, not some numbers). And remember - this back-replacement is a feature, not a bug!
java.lang.IllegalArgumentException: Expected a false (off) categorical split mask for continuous feature lookup(FIELD_1), got true (on)
For some reason, your LightGBM models still thinks that "FIELD_1" is continuous, not a categorical.
Possible causes - the LookupTransformer
didn't provide correct mapping. Perhaps there are gaps in the sequence?
Also, did you set LightGBM's categorical_feature
fit parameter?
TLDR: You can keep experimenting with LookupTransformer
, as it should work eventually. But LabelEncoder
should be way easier (because you're dealing with a very standard operation).
I'm using Lightgbm's LGBMClassifier as my model, which expects its categorical features in numerical format.
For that reason, I'm trying to implement a PMMLPipeline that has 2 DataFrameMappers as preprocessing_steps. The first one applies a LookupTransformer to map the categorical features values from string to int, and the second one applies the CategoricalDomain, ContinuousDomain decorators to my feature columns.
AFAIC, it is not possible to have 2 DataFrameMapper steps in the current sklearn2pmml version.
Is there a way to go about this?
Notice that I need the 2 DataFrameMappers to be executed in order (i.e. FeatureUnion does not apply here).
Thanks!