jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

LookupTransformer TypeError when default value is not exactly the type from the mapping #403

Closed danmar3 closed 6 months ago

danmar3 commented 9 months ago

Hi, Not sure if this is intended behavior, but the following initialization fails with TypeError:

import numpy as np
from sklearn2pmml.preprocessing import LookupTransformer

LookupTransformer({'a': 1.1, 'b': 2.0, 'c': 3.0}, np.float64(0))

This fails with:

sklearn2pmml/preprocessing/__init__.py:345, in LookupTransformer.__init__(self, mapping, default_value)
    343     if v_type is not None:
    344         if type(default_value) != v_type:
--> 345             raise TypeError("Default value is not a {0}".format(v_type.__name__))
    346 self.default_value = default_value

TypeError: Default value is not a float

I think this can be fixed by relaxing the inequality in this line with not isinstance(default_value, v_type)

Thank you

vruusmann commented 9 months ago

This type check is intended to block situations, where the mapped values and the default values are of completely different data type. For example, the former are floats, and then the latter is an int or str.

Mis-matching data types can cause some hard-to-debug issues, because they render differently when printed to text documents. Very rare (affects only a small number of values), but when it does happen, it's very tricky to debug.

Howver, in the this issue, I see that you're suggesting that the type check should be relaxed, so that in addition to strict type equality check, also type-is-a-subtype-of checks should be enabled. The case in point is that numpy.float64 should be considered a subclass of builtins.float?

import numpy

print(type(1.0)) # <class 'float'>
print(type(numpy.float64(1.0))) # <class 'numpy.float64'>

print(type(1.0) == type(numpy.float64(1.0))) # False

Makes sense. Will fix in the next release, but need to study Python's builti-ns vs Numpy type compatibility first.

vruusmann commented 6 months ago

I've been thinking about this issue, and I've decided to keep the current behaviour unchanged (ie. "won't fix").

It still makes sense to require absolute type identity both on the key side and on the value side. Once we start mixing types, it becomes possible that the dict lookup-behaviour breaks down (ie. a float dict key may or may not match a numpy.float64 lookup key), or that the returned values become variable (ie. sometimes you get float values, some other time numpy.float64 values) - the effects will cascade down the pipeline, and break some unsuspecting step much farther down.

If there's a chance of mixed type objects - most notably a mix of Python and Numpy scalars - then they should be normalized to either representation. I'd personally unpack Numpy scalars using the following helper:

def _normalize(x):
  if hasattr(x, "item"):
    return x.item()
  return x

raw_mapping = ...
raw_default_value = ...

transformer = LookupTransformer(mapping = {k : _normalize(v) for k, v in raw_mapping.items()}, default_value = _normalize(raw_default_value))