jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Data types #152

Closed testlambda693 closed 3 years ago

testlambda693 commented 3 years ago

Hi, I'm using sklearn2pmml to create a simple feature. the data type should be str and i am looking for a specific value when trying to convert the PMML to java and Scala. they are trying to work with double variable and not with str. i tried to add the CategoricalDomain(dtype=str), . it worked however in the PMML i see all values of the city names and it takes long time to processes it. is there any way to define is as str without the need to have it as categorical? this is my code

recorder.features = recorder.features + [(
    [col_to_pmml_dict["citiy_name"]], 
    [      
#         CastTransformer(str),
        CategoricalDomain(dtype=str),
            SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Miss'),
            SubstringTransformer(15, 20),
          Alias(ExpressionTransformer("0 if X[0] == ':[]}' else 1"),name='citiy_name_in_none_1'),
            Alias(CastTransformer(int), name="citiy_name_in_none")

    ], {'alias': "citiy_name_in_none"}
)]
vruusmann commented 3 years ago

when trying to convert the PMML to java and Scala. they are trying to work with double variable and not with str

How do you convert PMML to Java? Any examples?

is there any way to define is as str without the need to have it as categorical?

Categorical/Ordinal/Continuous are operational types. The optype defines the set of operations that are valid for a particular value. For example, categorical values can only be used in equality check operations (== and !=), whereas continuous values can also be used in comparison operations (<, <=, etc).

It does not make sense to combine a string datatype with a continuous operational type. Any computer language will declare the following expression illegal/non-sensical: "one" <= "two" or "two" > "zero".

When comparing equality checks, then the data type shouldn't matter much. Essentially, it should take the same amount of work to compare two string values or two numeric values with each other.

[CategoricalDomain(dtype=str), CastTransformer(int)]

Casting a free-form string value to an int? What do you expect to accomplish here?

For example, Estonia's capital is called "Tallinn". What do you think the following expression evaluates to?

String cityName = "Tallinn";
int cityNameAsInt = (int)cityName;
vruusmann commented 3 years ago

Closing with the resolution - "the feature request of casting free-form text to integers does not make sense"

testlambda693 commented 3 years ago

Hi ,

My use case is like this:

I want to create a feature that if the city name is =='unknown' i get 1 and on other cases it is 0.

for that i need to get my city name as string and compare it my output is a different name of feature with int as the output data type. when i use the categorical str datatype in my PMML file i am getting many lines like

DataField dataType="string" name="city_name" optype="categorical">
            <Value value="None" />
            <Value value="aquitaine@gironde@fr" />
            <Value value="attiki@attiki@gr" />
            <Value value="bretagne@ille-et-vilaine@fr" />
            <Value value="catalunya@barcelona@es" />
            <Value value="comunidad de madrid@madrid@es" />
            <Value value="east midlands@derby@gb" />
            <Value value="great lakes@illinois@us" />
......

So is there a way to avoid it?

vruusmann commented 3 years ago

I want to create a feature that if the city name is =='unknown' i get 1 and on other cases it is 0.

ExpressionTransformer("1 if city_name == 'unknown' else 0")

when i use the categorical str datatype in my PMML file i am getting many lines like So is there a way to avoid it?

What kind of problems are DataField/Value child elements causing to you? Doesn't look beautiful, or something else? These elements definitely don't affect model scoring performance in a substantial way.

However, to suppress DataField/Value child elements:

  1. Don't use CategoricalDomain with that column.
  2. Use CategoricalDomain, but explicitly ask it to refrain from capturing category levels: CategoricalDomain(with_data = False)
testlambda693 commented 3 years ago

Thanks the CategoricalDomain(with_data = False) helped