Closed cuckootan closed 6 years ago
When I remove some key-value pair in lrHeartInputMap, the predicted result is null.
The JPMML-Evaluator library performs scoring as specified in the PMML file. Apparently, your model contains the following specification: "if some field value (aka key-value pair) is missing, return a missing prediction".
This is not a bug. It simply means that your model cannot perform the scoring when the input data record is incomplete.
When I remove some key-value pair in lrHeartInputMap, the predicted result is null.
The JPMML-Evaluator library performs scoring as specified in the PMML file. Apparently, your model contains the following specification: "if some field value (aka key-value pair) is missing, return a missing prediction".
This is not a bug. It simply means that your model cannot perform the scoring when the input data record is incomplete.
Thanks for your reply. I'm a newbie and I don't know how to export pmml file for my model that can perform the scoring when the input data record is incomplete.
My code that export pmml file is this:
import pandas
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer, FunctionTransformer
from sklearn import linear_model
from sklearn2pmml import PMMLPipeline, sklearn2pmml
heart_data = pandas.read_csv("pmml_test.csv")
# 用Mapper定义特征工程
mapper = DataFrameMapper([
(['sbp'], MinMaxScaler()),
(['tobacco'], MinMaxScaler()),
('ldl', None),
('adiposity', None),
(['famhist'], LabelBinarizer()),
('typea', None),
('obesity', None),
('alcohol', None),
(['age'], FunctionTransformer(np.log)),
], sparse=True)
# 用pipeline定义使用的模型,特征工程等
pipeline = PMMLPipeline([
('mapper', mapper),
("classifier", linear_model.LogisticRegression())
])
pipeline.fit(heart_data[heart_data.columns.difference(["chd"])], heart_data["chd"])
# 导出模型文件
sklearn2pmml(pipeline, "pmml.xml", with_repr=True)
pmml_test.csv:
sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
160,12,5.73,23.11,Present,49,25.3,97.2,52,1
144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
114,0,3.83,19.4,Present,49,24.86,2.49,29,0
132,0,5.8,30.96,Present,69,30.11,0,53,1
206,6,2.95,32.27,Absent,72,26.81,56.06,60,1
134,14.1,4.44,22.39,Present,65,23.09,0,40,1
Please help me solve this problem, thanks!
All correct, what's the problem?
Your Scikit-Learn pipeline requires "dense" input vectors, meaning that all nine sbp
, tobacco
, ldl
, adiposity
, famhist
, typea
, obesity
, alcohol
and age
fields must have non-missing values. If you do PMMLPipeline#predict(X)
with an incomplete row (for example, do leave out the sbp
field), then Scikit-Learn would also give you a missing prediction (or fail with some error).
In that sense (J)PMML is exactly reproducing Scikit-Learn behaviour, which is the goal.
If you want to make your pipeline work with missing values, then you should include the Imputer
transformation into it:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
All correct, what's the problem?
Ok, I train a LR model with sklearn and save it in pmml style with sklearn2pmml. Then I want to use JPMML-Evaluator to evaluate score for an input feature map with some empty dimensions, but the evaluated score is "null". That's my problem.
For example, the train dataset have 9 features, including "sbp, tobacco, ldl, adiposity, famhist, typea, obesity, alcohol, age", now I want to predict the score of an input whose features just include sbp, tobacco, ldl.
My code is like this:
public class JpmmlService {
private static Map<String, Object> lrHeartInputMap = new HashMap<String, Object>() {{
put("sbp", 142);
put("tobacco", 2);
put("ldl", 3);
}};
public static void main(String[] args) throws FileNotFoundException {
String pmmlDataDir = ResourceUtils.getFile("classpath:pmml").getPath();
PMML pmml = null;
try (InputStream is = new FileInputStream(new File(pmmlDataDir + "/pmml.xml"))) {
pmml = PMMLUtil.unmarshal(is);
} catch (IOException | JAXBException | SAXException e) {
e.printStackTrace();
}
if (pmml == null) {
return;
}
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
ModelEvaluator<?> modelEvaluator = modelEvaluatorFactory.newModelEvaluator(pmml);
List<InputField> inputFields = ((Evaluator) modelEvaluator).getInputFields();
//过模型的原始特征,从画像中获取数据,作为模型输入
Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
for (InputField inputField : inputFields) {
FieldName inputFieldName = inputField.getName();
Object rawValue = lrHeartInputMap.get(inputFieldName.getValue());
FieldValue inputFieldValue = inputField.prepare(rawValue);
arguments.put(inputFieldName, inputFieldValue);
}
Map<FieldName, ?> results = ((Evaluator) modelEvaluator).evaluate(arguments);
List<TargetField> targetFields = ((Evaluator) modelEvaluator).getTargetFields();
//获得结果,作为回归预测的例子,只有一个输出。对于分类问题等有多个输出。
for (TargetField targetField : targetFields) {
FieldName targetFieldName = targetField.getName();
Object targetFieldValue = results.get(targetFieldName);
System.out.println("target: " + targetFieldName.getValue() + " value: " + targetFieldValue);
}
}
}
However, it output "null". According to your first answer, it seems that I train and save model in wrong way. But I don't know where the problem is.
I want to predict the score of an input whose features just include sbp, tobacco, ldl.
Your Scikit-Learn pipeline does not support incomplete input data records (only three data fields out of nine are available). Why do you expect (J)PMML support it?
I want to predict the score of an input whose features just include sbp, tobacco, ldl.
Your Scikit-Learn pipeline does not support incomplete input data records (only three data fields out of nine are available). Why do you expect (J)PMML support it?
I want to use JPMML-Evaluator in productive environment where millions of features exist. When conduct online prediction, it may take millions of bytes of memory per prediction if I input all features into model. Because of sparsity of input features, I just want to input non-zero filed into model.
I want to use JPMML-Evaluator in productive environment where millions of features exist.
Million features per model is unreal.
Anyway, I just remembered that you could specify a missing value replacement value using the sklearn2pmml.decoration.(Categorical|Continuous)Domain
transformation:
from sklearn2pmml.decoration import ContinuousDomain
mapper = DataFrameMapper([
(['sbp'], [ContinuousDomain(missing_value_replacement = 0), MinMaxScaler()])
])
For example, the above would order JPMML-Evaluator to replace a missing sbt
value with 0.
My code like this:
It seems that the evaluate method does not support sparse vector. When I remove some key-value pair in lrHeartInputMap, the predicted result is null.