Why is datatype continuous and double when the data is a string?

wuangKKK commented 4 years ago

....... Why is datatype continuous and double when the data is a string?

wuangKKK commented 4 years ago

vruusmann commented 4 years ago

Example?

wuangKKK commented 4 years ago

I saved the model like this

wuangKKK commented 4 years ago

@vruusmann

wuangKKK commented 4 years ago

pmml like this @vruusmann

vruusmann commented 4 years ago

See the source code of the DictVectorizer converter: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/feature_extraction/DictVectorizer.java

Your feature is considered to be a numeric, because the DictVectorizer.separator attribute is not specified.

vruusmann commented 4 years ago

Looking at the source code of DictVectorizer converter again, then the field type is determined differently for new (this issue) and existing fields (my integration tests). This needs to be unified.

vruusmann commented 4 years ago

My code:

from sklearn.feature_extraction import DictVectorizer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from xgboost import XGBClassifier

import pandas

df = pandas.read_csv("Audit.csv")

df_X = df[df.columns.values[0:-1]]
df_X = df_X.to_dict("records")

df_y = df["Adjusted"]

pipeline = PMMLPipeline([
    ("mapper", DictVectorizer()),
    ("classifier", XGBClassifier())
])
pipeline.fit(df_X, df_y)

sklearn2pmml(pipeline, "Audit.pmml")

All continuous and categorical features are correctly detected as continuous+double and categorical+string, respectively.

Closing as "not reproducible". Whatever the problem, it must be related to your own application code, not the JPMML-SkLearn/SkLearn2PMML stack.

jpmml / jpmml-sklearn

Why is datatype continuous and double when the data is a string? #123