jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Why is datatype continuous and double when the data is a string? #123

Closed wuangKKK closed 4 years ago

wuangKKK commented 4 years ago
....... Why is datatype continuous and double when the data is a string?
wuangKKK commented 4 years ago
wuangKKK commented 4 years ago
vruusmann commented 4 years ago

Example?

wuangKKK commented 4 years ago

image

wuangKKK commented 4 years ago

I saved the model like this image

wuangKKK commented 4 years ago

@vruusmann

wuangKKK commented 4 years ago

pmml like this image @vruusmann

vruusmann commented 4 years ago

See the source code of the DictVectorizer converter: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/feature_extraction/DictVectorizer.java

Your feature is considered to be a numeric, because the DictVectorizer.separator attribute is not specified.

vruusmann commented 4 years ago

Looking at the source code of DictVectorizer converter again, then the field type is determined differently for new (this issue) and existing fields (my integration tests). This needs to be unified.

vruusmann commented 4 years ago

My code:

from sklearn.feature_extraction import DictVectorizer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from xgboost import XGBClassifier

import pandas

df = pandas.read_csv("Audit.csv")

df_X = df[df.columns.values[0:-1]]
df_X = df_X.to_dict("records")

df_y = df["Adjusted"]

pipeline = PMMLPipeline([
    ("mapper", DictVectorizer()),
    ("classifier", XGBClassifier())
])
pipeline.fit(df_X, df_y)

sklearn2pmml(pipeline, "Audit.pmml")

All continuous and categorical features are correctly detected as continuous+double and categorical+string, respectively.

Closing as "not reproducible". Whatever the problem, it must be related to your own application code, not the JPMML-SkLearn/SkLearn2PMML stack.