jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

The evaluate method does not support sparse vector #130

Closed cuckootan closed 6 years ago

cuckootan commented 6 years ago

My code like this:

public class JpmmlService {

    private static Map<String, Object> lrHeartInputMap = new HashMap<String, Object>() {{
        put("sbp", 142);
        put("tobacco", 2);
        put("ldl", 3);
        put("adiposity", 30);
        put("famhist", "Present");
        put("typea", 83);
        put("obesity", 23);
        put("alcohol", 90);
        put("age", 30);
    }};

    public static void main(String[] args) throws FileNotFoundException {

        String pmmlDataDir = ResourceUtils.getFile("classpath:pmml").getPath();
        PMML pmml = null;
        try (InputStream is = new FileInputStream(new File(pmmlDataDir + "/pmml.xml"))) {

            pmml = PMMLUtil.unmarshal(is);
        } catch (IOException | JAXBException | SAXException e) {
            e.printStackTrace();
        }
        if (pmml == null) {
            return;
        }

        ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
        ModelEvaluator<?> modelEvaluator = modelEvaluatorFactory.newModelEvaluator(pmml);

        List<InputField> inputFields = ((Evaluator) modelEvaluator).getInputFields();
        //过模型的原始特征,从画像中获取数据,作为模型输入
        Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
        for (InputField inputField : inputFields) {
            FieldName inputFieldName = inputField.getName();
            Object rawValue = lrHeartInputMap.get(inputFieldName.getValue());
            FieldValue inputFieldValue = inputField.prepare(rawValue);
            arguments.put(inputFieldName, inputFieldValue);
        }

        Map<FieldName, ?> results = ((Evaluator) modelEvaluator).evaluate(arguments);
        List<TargetField> targetFields = ((Evaluator) modelEvaluator).getTargetFields();
        //获得结果,作为回归预测的例子,只有一个输出。对于分类问题等有多个输出。
        for (TargetField targetField : targetFields) {
            FieldName targetFieldName = targetField.getName();
            Object targetFieldValue = results.get(targetFieldName);
            System.out.println("target: " + targetFieldName.getValue() + " value: " + targetFieldValue);
        }
    }
}

It seems that the evaluate method does not support sparse vector. When I remove some key-value pair in lrHeartInputMap, the predicted result is null.

vruusmann commented 6 years ago

When I remove some key-value pair in lrHeartInputMap, the predicted result is null.

The JPMML-Evaluator library performs scoring as specified in the PMML file. Apparently, your model contains the following specification: "if some field value (aka key-value pair) is missing, return a missing prediction".

This is not a bug. It simply means that your model cannot perform the scoring when the input data record is incomplete.

cuckootan commented 6 years ago

When I remove some key-value pair in lrHeartInputMap, the predicted result is null.

The JPMML-Evaluator library performs scoring as specified in the PMML file. Apparently, your model contains the following specification: "if some field value (aka key-value pair) is missing, return a missing prediction".

This is not a bug. It simply means that your model cannot perform the scoring when the input data record is incomplete.

Thanks for your reply. I'm a newbie and I don't know how to export pmml file for my model that can perform the scoring when the input data record is incomplete.

My code that export pmml file is this:

import pandas
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer, FunctionTransformer
from sklearn import linear_model
from sklearn2pmml import PMMLPipeline, sklearn2pmml

heart_data = pandas.read_csv("pmml_test.csv")
# 用Mapper定义特征工程
mapper = DataFrameMapper([
    (['sbp'], MinMaxScaler()),
    (['tobacco'], MinMaxScaler()),
    ('ldl', None),
    ('adiposity', None),
    (['famhist'], LabelBinarizer()),
    ('typea', None),
    ('obesity', None),
    ('alcohol', None),
    (['age'], FunctionTransformer(np.log)),
], sparse=True)

# 用pipeline定义使用的模型,特征工程等
pipeline = PMMLPipeline([
    ('mapper', mapper),
    ("classifier", linear_model.LogisticRegression())
])

pipeline.fit(heart_data[heart_data.columns.difference(["chd"])], heart_data["chd"])
# 导出模型文件
sklearn2pmml(pipeline, "pmml.xml", with_repr=True)

pmml_test.csv:

sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
160,12,5.73,23.11,Present,49,25.3,97.2,52,1
144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
114,0,3.83,19.4,Present,49,24.86,2.49,29,0
132,0,5.8,30.96,Present,69,30.11,0,53,1
206,6,2.95,32.27,Absent,72,26.81,56.06,60,1
134,14.1,4.44,22.39,Present,65,23.09,0,40,1

Please help me solve this problem, thanks!

vruusmann commented 6 years ago

All correct, what's the problem?

Your Scikit-Learn pipeline requires "dense" input vectors, meaning that all nine sbp, tobacco, ldl, adiposity, famhist, typea, obesity, alcohol and age fields must have non-missing values. If you do PMMLPipeline#predict(X) with an incomplete row (for example, do leave out the sbp field), then Scikit-Learn would also give you a missing prediction (or fail with some error).

In that sense (J)PMML is exactly reproducing Scikit-Learn behaviour, which is the goal.

If you want to make your pipeline work with missing values, then you should include the Imputer transformation into it: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

cuckootan commented 6 years ago

All correct, what's the problem?

Ok, I train a LR model with sklearn and save it in pmml style with sklearn2pmml. Then I want to use JPMML-Evaluator to evaluate score for an input feature map with some empty dimensions, but the evaluated score is "null". That's my problem.

For example, the train dataset have 9 features, including "sbp, tobacco, ldl, adiposity, famhist, typea, obesity, alcohol, age", now I want to predict the score of an input whose features just include sbp, tobacco, ldl.

My code is like this:

public class JpmmlService {

    private static Map<String, Object> lrHeartInputMap = new HashMap<String, Object>() {{
        put("sbp", 142);
        put("tobacco", 2);
        put("ldl", 3);
    }};

    public static void main(String[] args) throws FileNotFoundException {

        String pmmlDataDir = ResourceUtils.getFile("classpath:pmml").getPath();
        PMML pmml = null;
        try (InputStream is = new FileInputStream(new File(pmmlDataDir + "/pmml.xml"))) {

            pmml = PMMLUtil.unmarshal(is);
        } catch (IOException | JAXBException | SAXException e) {
            e.printStackTrace();
        }
        if (pmml == null) {
            return;
        }

        ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
        ModelEvaluator<?> modelEvaluator = modelEvaluatorFactory.newModelEvaluator(pmml);

        List<InputField> inputFields = ((Evaluator) modelEvaluator).getInputFields();
        //过模型的原始特征,从画像中获取数据,作为模型输入
        Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
        for (InputField inputField : inputFields) {
            FieldName inputFieldName = inputField.getName();
            Object rawValue = lrHeartInputMap.get(inputFieldName.getValue());
            FieldValue inputFieldValue = inputField.prepare(rawValue);
            arguments.put(inputFieldName, inputFieldValue);
        }

        Map<FieldName, ?> results = ((Evaluator) modelEvaluator).evaluate(arguments);
        List<TargetField> targetFields = ((Evaluator) modelEvaluator).getTargetFields();
        //获得结果,作为回归预测的例子,只有一个输出。对于分类问题等有多个输出。
        for (TargetField targetField : targetFields) {
            FieldName targetFieldName = targetField.getName();
            Object targetFieldValue = results.get(targetFieldName);
            System.out.println("target: " + targetFieldName.getValue() + " value: " + targetFieldValue);
        }
    }
}

However, it output "null". According to your first answer, it seems that I train and save model in wrong way. But I don't know where the problem is.

vruusmann commented 6 years ago

I want to predict the score of an input whose features just include sbp, tobacco, ldl.

Your Scikit-Learn pipeline does not support incomplete input data records (only three data fields out of nine are available). Why do you expect (J)PMML support it?

cuckootan commented 6 years ago

I want to predict the score of an input whose features just include sbp, tobacco, ldl.

Your Scikit-Learn pipeline does not support incomplete input data records (only three data fields out of nine are available). Why do you expect (J)PMML support it?

I want to use JPMML-Evaluator in productive environment where millions of features exist. When conduct online prediction, it may take millions of bytes of memory per prediction if I input all features into model. Because of sparsity of input features, I just want to input non-zero filed into model.

vruusmann commented 6 years ago

I want to use JPMML-Evaluator in productive environment where millions of features exist.

Million features per model is unreal.

Anyway, I just remembered that you could specify a missing value replacement value using the sklearn2pmml.decoration.(Categorical|Continuous)Domain transformation:

from sklearn2pmml.decoration import ContinuousDomain

mapper = DataFrameMapper([
    (['sbp'], [ContinuousDomain(missing_value_replacement = 0), MinMaxScaler()])
])

For example, the above would order JPMML-Evaluator to replace a missing sbt value with 0.