jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
528 stars 117 forks source link

Simple pre-processing transformation - a "Find/Replace" action for numerical features #62

Open asafbsimplex opened 6 years ago

asafbsimplex commented 6 years ago

I use sklearn2pmml for a bit of pre-processing and for using my model in production. I have the following problem - I have a numerical feature that I receive as a list with text items, that holds numbers, missing values(NaN's) and also textual characters, for example: feature = [ "2", "3", "NaN", "4", "ten", "eight", "22" ] I receive the feature as a textual list, and I my only control over the feature values is by using the "DataFrameMapper" in order to prepare it for prediction. I can't seem to find any function in "sklearn2pmml" or "sklearn_pandas" that will just replace the textual items in the list by my mapping dict ( {"ten":10...} and so) and will turn them into numerical types. Would this be possible with sklearn2pmml? Can you suggest a workaround?

My code looks kinda like that:

mapper = DataFrameMapper([ (continuous_features, [StandardScaler(), Imputer(strategy='median'), ....."Find/Replace the text values.....]), (categorical_features, LabelBinarizer()), #for one hot encoding ])

clf = PMMLPipeline([ ("mapper", mapper), ("classifier", RandomForestClassifier()) ])

clf.fit_transfrom(data)

vruusmann commented 6 years ago

I have a numerical feature that I receive as a list with text items, that holds numbers, missing values(NaN's) and also textual characters

Let's clarify the "definition" of your feature a bit:

  1. Is it a list-type feature (eg. [NaN, 1, two]) or a scalar-type feature (eg. NaN, 1 or two)? Most Scikit-Learn transformers operate on scalar-type features. Therefore, if you're dealing with a list-type feature, then the first activity should be to "collapse" it into a scalar-type feature using some aggregation function - for example, by summing all list values, or taking their average.
  2. What is the numeric datatype? Is it integer, float, or a mix of them? You cannot have NaN values with the integer data type - Numpy/Scikit-Learn data matrices would require such missing values to be represented as None instead.
  3. What is the expected "vocabulary" of strings? Is it limited to ten strings (ie. zero, one, two, .., nine), or is it unlimited (eg. two thousand and seventeen)? Can we assume that the numbers are spelled correctly?

If you only want to perform string to integer conversion, and your "vocabulary" is fixed, then the easiest solution would be to use the sklearn2pmml.preprocessing.LookupTransformer transformation, which encodes a lookup table:

vocabulary = {"one" : 1, "two" : 2, "three" : 3}
string2int = LookupTransformer(vocabulary, None) # If the string isn't contained in the vocabulary, then return missing/None value

Of course, there needs to be a way to represent the following business logic:

value_as_int = string2int.transform(value)
# The string2int converter found a match for the input value, so return it
if value_as_int != None
  return value_as_int
# Otherwise, return the input value unmodified (as it already appears to be an integer)
return value
vruusmann commented 6 years ago

A viable solution would be to first create a custom Scikit-Learn transformer class (let's call it "NumericValueSanitizer"), and then create a corresponding SkLear2PMML/JPMML-SkLearn plugin. Please refer to the sklearn2pmml-plugin project for more information.

If you can do the Python side of this NumericValueSanitizer transformation, then I could help you with the Java side - this looks like a good demo material.

asafbsimplex commented 6 years ago

Hi Vruusmann, Thanks for the quick response! About clarifying the feature's definition -

  1. It's a list type feature which is supposed to be transformed to a scalar after the transformation done in the pmml.
  2. The numeric data type could be integer but also float.
  3. The "vocabulary" of strings is finite and does not have to be a specific "number" name, it can be a also a generic string like "data is missing" which I'd like to replace with a specific float/integer value that represents this string.

I'm not that a good of a python programmer but I can try to write the "NumericValueSanitizer" class. I hope I'll manage to do it by next week.

Thanks.

vruusmann commented 6 years ago

I'm not that a good of a python programmer but I can try to write the "NumericValueSanitizer" class.

Just write down the main idea of your feature transformation logic (could be a standalone Python function, doesn't have to be a full-blown Scikit-Learn transformer class), and attach it to this issue when done. I'll then move it over to the sklearn2pmml-plugin project, and make interoperable with the PMML conversion suite.

The fact that your feature is initially a list of values, not a scalar value, could be a bit of a problem both for Scikit-Learn and (J)PMML. But there are workarounds available.

asafbsimplex commented 6 years ago

I attached the transformer class, I hope it's in the right format. ValueSanitizer.txt The code, in case it did not upload correctly is:

` from sklearn.base import TransformerMixin

 vocabulary_dict = {
     "one": 1,
     "three": 3,
     "NaN": -999,
     "data is missing": -1000
}

class ValueSanitizer(TransformerMixin):

    def __init__(self, vocabulary_dict):
        if type(vocabulary_dict) is not dict:
            raise ValueError("Input value for mapping is not a dict")
        self.vocabulary_dict = vocabulary_dict

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.vocabulary_dict[item] if item in self.vocabulary_dict else item for item in X]

`

If you would be able to add it to the transformations supported by the sklearn2pmml, it would be great. Thanks!

vruusmann commented 6 years ago

What about the size/length of those list-type features?

Can we make an assumption that feature values can contain up to n elements (say, n = 10), or do we need to support arbitrary size/length lists (say, n >= 100)?

asafbsimplex commented 6 years ago

If I understand the question correctly, the size/length of the features lists can be arbitrary, meaning n could be of any size as the batch size for the training/prediction would be.

The vocabulary dict on the other hand, if needed could be limited by size of 10 items lets say. Just to clarify, the type of the numeric values of the feature would be the same for all values except for the "words like" ones. for example: feature1 = [ "2", "3", "NaN", "4", "ten", "eight", "22" ]

feature2 = [ "1.3242", "3.22", "NaN", "4.554", "Missing_data", "eight", "22.9983" ]

asafbsimplex commented 6 years ago

Hi Vruusmann, Did you by chance take a look at the class I sent you?