jpmml / jpmml-evaluator-python

PMML evaluator library for Python
GNU Affero General Public License v3.0
20 stars 9 forks source link

Problems when inputting values for date/datetime fields #16

Closed liuhuanshuo closed 2 years ago

liuhuanshuo commented 2 years ago

Hello Villu

I'm sorry that I still need your help to troubleshoot a problem predicted by pmml

Last week, I successfully converted my Python model to pmml.

When I used pypmml to call the prediction, I found that the prediction value was inaccurate. Of course, I followed your instructions and installed JPMML-Evaluator-Python

However, when I used JPMML-Evaluator-Python, it didn't work properly and I just reported an error

Here is my code, written according to the readme prompt

from jpmml_evaluator import make_evaluator
from jpmml_evaluator.py4j import launch_gateway, Py4JBackend

# Launch the gateway
gateway = launch_gateway()

# Construct a Py4J backend based on the newly launched gateway
backend = Py4JBackend(gateway)

evaluator = make_evaluator(backend, "pipeline_test.pmml")
evaluator.evaluateAll(x_oot_1)

Here is the error code

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    128                 try:
--> 129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:

~/.local/lib/python3.7/site-packages/jpmml_evaluator/py4j.py in staticInvoke(self, className, methodName, *args)
     24                 javaMember = javaClass.__getattr__(methodName)
---> 25                 return javaMember(*args)
     26 

~/.local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1322         return_value = get_return_value(
-> 1323             answer, self.gateway_client, self.target_id, self.name)
   1324 

~/.local/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling z:org.jpmml.evaluator.python.PythonUtil.evaluateAll.
: org.jpmml.evaluator.TypeCheckException: Expected date value, got double value
    at org.jpmml.evaluator.TypeUtil.toDate(TypeUtil.java:784)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:508)
    at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:69)
    at org.jpmml.evaluator.ScalarValue.<init>(ScalarValue.java:33)
    at org.jpmml.evaluator.DiscreteValue.<init>(DiscreteValue.java:30)
    at org.jpmml.evaluator.OrdinalValue.<init>(OrdinalValue.java:38)
    at org.jpmml.evaluator.OrdinalValue.create(OrdinalValue.java:122)
    at org.jpmml.evaluator.FieldValue.create(FieldValue.java:364)
    at org.jpmml.evaluator.FieldValue.cast(FieldValue.java:109)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:72)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:345)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:142)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:101)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicate(PredicateUtil.java:73)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicateContainer(PredicateUtil.java:53)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateTree(SimpleTreeModelEvaluator.java:122)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateAny(SimpleTreeModelEvaluator.java:90)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateRegression(SimpleTreeModelEvaluator.java:77)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateRegression(MiningModelEvaluator.java:231)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:303)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:446)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:300)
    at org.jpmml.evaluator.python.PythonUtil.evaluate(PythonUtil.java:92)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:58)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

JavaError                                 Traceback (most recent call last)
<ipython-input-178-5a1bd5bd787f> in <module>
----> 1 evaluator.evaluateAll(x_oot_1)

~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:
--> 131                         raise self.backend.toJavaError(e)
    132                 result_records = self.backend.loads(result_records)
    133                 return DataFrame.from_records(result_records)

JavaError: org.jpmml.evaluator.TypeCheckException: Expected date value, got double value

I tried to analyze the problem by myself, and it seemed that the data format was wrong

JavaError: org.jpmml.evaluator.TypeCheckException: Expected date value, got double value

However, none of the columns in my input need date format, nor does it need date format itself. I used pipepline before to predict with the same data is OK (I don't know if you still remember, Detailed requirements I mentioned in [sklearn2pmml # 356] (https://github.com/jpmml/sklearn2pmml/issues/356))

I also checked my pmml file and it looks correct as well, none of the 60 features required are date columns

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="SkLearn2PMML package" version="0.86.0"/>
        <Timestamp>2022-10-28T06:27:47Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="my_single_target" optype="categorical" dataType="integer">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
        <DataField name="d3_daytime_opst_phone_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="m1_accu_cnt" optype="continuous" dataType="double"/>
        <DataField name="d1_under15s_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_creditcard_call_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_daytime_voice_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_under15s_called_opst_mbl_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_dur_under_10s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_accu_rm_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_accu_rm_called_dur" optype="continuous" dataType="double"/>
        <DataField name="d1_once_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_once_called_opst_mbl_cnt" optype="continuous" dataType="double"/>
        <DataField name="d1_once_called_opst_mbl_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_called_dur_under_30s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_nighttime_mbl_innet_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_daytime_voice_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_called_dur_under_10s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_dur_over_180s_cnt" optype="continuous" dataType="double"/>
        <DataField name="m1_accu_called_cnt" optype="continuous" dataType="double"/>
        <DataField name="d1_daytime_voice_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d7_loc_called_inact_day_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_under15s_called_opst_mbl_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="m3_accu_called_cnt" optype="continuous" dataType="double"/>
        <DataField name="d1_called_dur_under_30s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_rm_inact_day_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_daytime_opst_mbl_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_under15s_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="m3_accu_cnt" optype="continuous" dataType="double"/>
        <DataField name="d7_under15s_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_under15s_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="m3_accu_loc_cnt" optype="continuous" dataType="double"/>
        <DataField name="d15_called_dur_over_180s_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_accu_loc_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_once_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_daytime_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_dur_under_30s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d7_under15s_called_opst_mbl_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_accu_loc_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_daytime_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d3_under15s_opst_mbl_cnt" optype="continuous" dataType="double"/>
        <DataField name="d1_accu_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_bankloan_call_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_dur_under_10s_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d7_once_called_opst_mbl_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d7_dur_over_180s_cnt" optype="continuous" dataType="double"/>
        <DataField name="modify_date" optype="continuous" dataType="double"/>
        <DataField name="day_id" optype="continuous" dataType="double"/>
        <DataField name="d7_once_called_opst_mbl_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_once_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_daytime_opst_phone_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_mbl_innet_day" optype="continuous" dataType="double"/>
        <DataField name="d3_once_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_express_call_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d15_insurance_call_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_daytime_opst_mbl_called_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="d1_accu_loc_called_dur" optype="continuous" dataType="double"/>
        <DataField name="d1_called_opst_phone_cnt_rt" optype="continuous" dataType="double"/>
        <DataField name="m3_accu_loc_called_cnt" optype="continuous" dataType="double"/>
        <DataField name="d7_called_dur_over_180s_cnt" optype="continuous" dataType="double"/>
        <DataField name="d3_nighttime_mbl_innet_cnt" optype="continuous" dataType="double"/>
        <DataField name="channel_type_cd_3" optype="categorical" dataType="string"/>
    </DataDictionary>

So I can't tell what the problem is.

The only thing I can think of is maybe the problem is not in the input but in the output?

Because I am in order to avoid an error (similar to ), added that one line of code

pipeline_test.target_fields = ["my_single_target"]

I don't know whether this is the cause of the problem, in a word, could you help me to make a simple analysis

liuhuanshuo commented 2 years ago

I am concerned about the issue of a new (https://github.com/jpmml/sklearn2pmml/issues/357)

I used pipeline_test._final_estimator.n_outputs_ = 1 instead of pipeline_test.target_fields = ["my_single_target"] as you replied in the post.

Then save pmml again and use JPMML-Evaluator-Python to read the model for prediction

Now, instead of prompting the previous error, it prints another error

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    128                 try:
--> 129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:

~/.local/lib/python3.7/site-packages/jpmml_evaluator/py4j.py in staticInvoke(self, className, methodName, *args)
     24                 javaMember = javaClass.__getattr__(methodName)
---> 25                 return javaMember(*args)
     26 

~/.local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1322         return_value = get_return_value(
-> 1323             answer, self.gateway_client, self.target_id, self.name)
   1324 

~/.local/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling z:org.jpmml.evaluator.python.PythonUtil.evaluateAll.
: java.lang.IllegalArgumentException: 2.02-20-92
    at org.jpmml.model.temporals.DateTimeUtil.parseDate(DateTimeUtil.java:21)
    at org.jpmml.evaluator.TypeUtil.parse(TypeUtil.java:90)
    at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:66)
    at org.jpmml.evaluator.ScalarValue.<init>(ScalarValue.java:33)
    at org.jpmml.evaluator.DiscreteValue.<init>(DiscreteValue.java:30)
    at org.jpmml.evaluator.OrdinalValue.<init>(OrdinalValue.java:38)
    at org.jpmml.evaluator.OrdinalValue.create(OrdinalValue.java:122)
    at org.jpmml.evaluator.FieldValue.create(FieldValue.java:364)
    at org.jpmml.evaluator.FieldValue.cast(FieldValue.java:109)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:72)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:345)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:142)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:101)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicate(PredicateUtil.java:73)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicateContainer(PredicateUtil.java:53)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateTree(SimpleTreeModelEvaluator.java:122)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateAny(SimpleTreeModelEvaluator.java:90)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateRegression(SimpleTreeModelEvaluator.java:77)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateRegression(MiningModelEvaluator.java:231)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:303)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:446)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:300)
    at org.jpmml.evaluator.python.PythonUtil.evaluate(PythonUtil.java:92)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:58)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.time.format.DateTimeParseException: Text '2.02-20-92' could not be parsed at index 0
    at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
    at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1851)
    at java.time.LocalDate.parse(LocalDate.java:400)
    at java.time.LocalDate.parse(LocalDate.java:385)
    at org.jpmml.model.temporals.Date.parse(Date.java:86)
    at org.jpmml.model.temporals.DateTimeUtil.parseDate(DateTimeUtil.java:19)
    ... 70 more

During handling of the above exception, another exception occurred:

JavaError                                 Traceback (most recent call last)
<ipython-input-201-5a1bd5bd787f> in <module>
----> 1 evaluator.evaluateAll(x_oot_1)

~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:
--> 131                         raise self.backend.toJavaError(e)
    132                 result_records = self.backend.loads(result_records)
    133                 return DataFrame.from_records(result_records)

JavaError: java.lang.IllegalArgumentException: 2.02-20-92

I probably know what this error means, presumably there is a problem with the string conversion? Could it be something wrong with the following code?Because 2.02-20-92looks like 2022092x

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and X[0][0:8] < '20221230' else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_day_id_pipeline():
    return make_pipeline(ExpressionTransformer("X[1][:4] + '-'+ X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_feature_union():
    return FeatureUnion([
        ("modify_date", make_modify_date_pipeline()),
        ("day_id", make_day_id_pipeline())])

But I need to emphasize that the above custom functions work well on pipeline, my pipeline is completely correct and it predicts the correct result.

It seems to be back to the previous problem "My pipeline works fine, I just converted the pipeline to a pmml file and it doesn't work!"

So I don't know whether this is the problem of sklearn2pmml or JPMML-Evaluator-Python. Could you please help me to study it

vruusmann commented 2 years ago

Your PMML declares that all 60 input fields are of double data type. The problem is that there is no implicit (ie. automatic) conversion possible from double value space to date (or datetime) value space.

You have to re-declare the relevant input fields so that implicit value conversion would be possible. Alternatively, you may implement custom conversion using some DerivedField element-based business logic.

vruusmann commented 2 years ago

TLDR: You cannot represent (prospective-) date (or datetime) values as Python's float or numpy.float64 values. You should convert them to int or numpy.int64 values!

JavaError: java.lang.IllegalArgumentException: 2.02-20-92

Your input values are something like 20221031. If you store this value as double, it becomes 2.0221031E7.

Do you now see where this 2.02 prefix came from?

vruusmann commented 2 years ago

I am concerned about the issue of a new (https://github.com/jpmml/sklearn2pmml/issues/357)

No, this issue is totally unrelated to that.

Your pipeline works in Python, because Python performs very liberal type casts. Your pipeline would not work in any strict and statically typed programming language (such as PMML), because the necessary type casts could possible add or remove precision pretty much randomly.

In other words, this is legal in Python, but not in other languages:

# A float magically becomes a date, WTAF?
day_id = asdate(2.0221031E7)

The SkLearn2PMML package provides so-called domain decorator classes (inside the sklearn2pmml.decoration module) for pre-declaring input field type information.

The following might help:

mapper = DataFrameMapper([
    # THIS: First specify 'modify_date', then specify 'day_id'
    (['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain())]),
    (['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])
vruusmann commented 2 years ago

In other words, this is legal in Python, but not in other languages:

day_id = asdate(2.0221031E7)

In other words, Python is like Microsoft Excel, which auto-converts everything into a date/datetime.

vruusmann commented 2 years ago

The following might help:

You should actually combine these two lines into one:

mapper = DataFrameMapper([
    (['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()), make_feature_union(), ExpressionTransformer("X[1] - X[0]")]),
])
liuhuanshuo commented 2 years ago

Your pipeline works in Python, because Python performs very liberal type casts. Your pipeline would not work in any strict and statically typed programming language (such as PMML), because the necessary type casts could possible add or remove precision pretty much randomly.

Thank you very much for your answer. I probably know the reason (although I am not quite clear how to solve it).

As an algorithm engineer, I don't pay much attention to these underlying data structure issues. I learned a lot from your reply.

I hear a lot about Python's dynamic typing, or how not specifying a type can be a disaster, and I think that might be the case.

The following might help:

I will deal with this as you suggested, it seems that all columns like '20200909' need a type designation?

Anyway, I'm going to try it for myself first!

liuhuanshuo commented 2 years ago
(['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain())])

I modified the code as follows. Unfortunately, even the pipeline doesn't work anymore

(['modify_date','day_id'],[MultiDomain(ContinuousDomain(dtype = np.int64)), DateDomain(), make_feature_union(), ExpressionTransformer("X[1] - X[0] if X[1]>X[0] else -1")]),

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and X[0][0:8] < '20221230' else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_day_id_pipeline():
    return make_pipeline(ExpressionTransformer("X[1][:4] + '-'+ X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_feature_union():
    return FeatureUnion([
        ("modify_date", make_modify_date_pipeline()),
        ("day_id", make_day_id_pipeline())])

here is the error code

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-238-2b9604b4aa7d> in <module>
----> 1 pipeline_test.predict_proba(x_oot_1)

~/.local/lib/python3.7/site-packages/sklearn2pmml/pipeline/__init__.py in predict_proba(self, X, **predict_proba_params)
     82 
     83         def predict_proba(self, X, **predict_proba_params):
---> 84                 Xt = self._transform(X)
     85                 return self.steps[-1][-1].predict_proba(Xt, **predict_proba_params)
     86 

~/.local/lib/python3.7/site-packages/sklearn2pmml/pipeline/__init__.py in _transform(self, X)
     74                 if hasattr(self, "_iter"):
     75                         for _, name, transform in self._iter(with_final = False):
---> 76                                 Xt = transform.transform(Xt)
     77                 else:
     78                         for name, transform in self.steps[:-1]:

~/.local/lib/python3.7/site-packages/sklearn_pandas/dataframe_mapper.py in transform(self, X)
    217             Xt = self._get_col_subset(X, columns)
    218             if transformers is not None:
--> 219                 Xt = transformers.transform(Xt)
    220             extracted.append(_handle_feature(Xt))
    221 

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
    553         Xt = X
    554         for _, _, transform in self._iter():
--> 555             Xt = transform.transform(Xt)
    556         return Xt
    557 

~/.local/lib/python3.7/site-packages/sklearn2pmml/decoration/__init__.py in transform(self, X)
    299         def transform(self, X):
    300                 rows, columns = X.shape
--> 301                 if len(self.domains) != columns:
    302                         raise ValueError("The number of columns {0} is not equal to the number of domain objects {1}".format(columns, len(self.domains)))
    303                 if isinstance(X, DataFrame):

TypeError: object of type 'ContinuousDomain' has no len()
liuhuanshuo commented 2 years ago

The following might help:

You should actually combine these two lines into one:

mapper = DataFrameMapper([
  (['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()), make_feature_union(), ExpressionTransformer("X[1] - X[0]")]),
])

I tried to insert code in various places, but nothing worked.

MultiDomain(ContinuousDomain(dtype = np.int64)), DateDomain(), 

The following error is always displayed

TypeError: object of type 'ContinuousDomain' has no len()
liuhuanshuo commented 2 years ago

I think I may have found the problem.

In my opinion, modify_date and day_id should not be converted to np.int, but to string format because these two columns will be split in the function

I don't know what transformer would convert these two columns to string format, though.

But I tried to format these two columns in the pmml file in the same format as the other columns

<DataField name="modify_date" optype="categorical" dataType="string"/>
<DataField name="day_id" optype="categorical" dataType="string"/>

Now, importing the pmml file for the prediction issues another error!

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    128                 try:
--> 129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:

~/.local/lib/python3.7/site-packages/jpmml_evaluator/py4j.py in staticInvoke(self, className, methodName, *args)
     24                 javaMember = javaClass.__getattr__(methodName)
---> 25                 return javaMember(*args)
     26 

~/.local/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1322         return_value = get_return_value(
-> 1323             answer, self.gateway_client, self.target_id, self.name)
   1324 

~/.local/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling z:org.jpmml.evaluator.python.PythonUtil.evaluateAll.
: org.jpmml.evaluator.EvaluationException: Categorical value cannot be used in comparison operations
    at org.jpmml.evaluator.CategoricalValue.compareToValue(CategoricalValue.java:47)
    at org.jpmml.evaluator.functions.ComparisonFunction.evaluate(ComparisonFunction.java:37)
    at org.jpmml.evaluator.functions.BinaryFunction.evaluate(BinaryFunction.java:43)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFunction(ExpressionUtil.java:463)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:426)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:345)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ExpressionUtil.evaluateFieldRef(ExpressionUtil.java:226)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:143)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:405)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateApply(ExpressionUtil.java:345)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpression(ExpressionUtil.java:167)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:129)
    at org.jpmml.evaluator.ExpressionUtil.evaluateExpressionContainer(ExpressionUtil.java:61)
    at org.jpmml.evaluator.ExpressionUtil.evaluateTypedExpressionContainer(ExpressionUtil.java:66)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:86)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:100)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:142)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:94)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:101)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicate(PredicateUtil.java:73)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicateContainer(PredicateUtil.java:53)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateTree(SimpleTreeModelEvaluator.java:122)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateAny(SimpleTreeModelEvaluator.java:90)
    at org.jpmml.evaluator.tree.SimpleTreeModelEvaluator.evaluateRegression(SimpleTreeModelEvaluator.java:77)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateRegression(MiningModelEvaluator.java:231)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:443)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:595)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:303)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:446)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:300)
    at org.jpmml.evaluator.python.PythonUtil.evaluate(PythonUtil.java:92)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:58)
    at org.jpmml.evaluator.python.PythonUtil.evaluateAll(PythonUtil.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

JavaError                                 Traceback (most recent call last)
<ipython-input-292-5a1bd5bd787f> in <module>
----> 1 evaluator.evaluateAll(x_oot_1)

~/.local/lib/python3.7/site-packages/jpmml_evaluator/__init__.py in evaluateAll(self, arguments_df, nan_as_missing)
    129                         result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    130                 except Exception as e:
--> 131                         raise self.backend.toJavaError(e)
    132                 result_records = self.backend.loads(result_records)
    133                 return DataFrame.from_records(result_records)

JavaError: org.jpmml.evaluator.EvaluationException: Categorical value cannot be used in comparison operations

Could you tell me how to do it?

I want to repeat my requirements again.

The modify_date and day_id inputs are both like '20220909' and '20220101'

I just need to calculate their time difference (modify_date-day_id)

Of course, there are some other restrictions for modify_date, such as it cannot be empty and cannot be greater than 20221231, which is why the following if code exists

if len(X[0]) > 0 and X[0][0:8] < '20221230' else '2022-12-30'

I really need your help!

vruusmann commented 2 years ago

The following error is always displayed

TypeError: object of type 'ContinuousDomain' has no len()

The MultiDomain constructor expects a Python list of child decorators: https://github.com/jpmml/sklearn2pmml/blob/0.87.0/sklearn2pmml/decoration/__init__.py#L288-L289

So, the correct syntax would be like this (one child decorator per column - one for modify_date and another for day_id):

decorator = MultiDomain([ContinuousDomain(), DateDomain()])
vruusmann commented 2 years ago

org.jpmml.evaluator.EvaluationException: Categorical value cannot be used in comparison operations

We've discussed this situation before - comparing one string with another using comparison operators like <', <=, => and > does not make sense:

my_date = "20221031"

if my_date < "20221101":
  print("Date is earlier than 1st of November, 2022")

I remember commenting that I would expect to see a type check error being thrown... I can't find my comment, but this is exactly the kind of exception that I was hoping to see.

vruusmann commented 2 years ago

The modify_date and day_id inputs are both like '20220909' and '20220101'

They are both strings that match pattern "YYYYMMDD". You need to re-format to ISO 8601 date format pattern, which is YYYY-MM-DD.

We can use ExpressionTransformer for this:

string_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]")

However, it is possible that modify_date is either an empty string, or a date string that is greater than some "upper limit" date.

When working with strings, then you can only implement the first part of the above clause (ie. string is empty/not empty). You cannot do the second part, because the comparison operator <= does not work with strings.

modify_date_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'")
day_id_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]")

After reformatting, you can cast them to date data type using CastTransformer(dtype = "datetime64[D]"), and then transform them to a numeric value "number of days since some reference date" using DaysSinceYearTransformer(year = 2022).

The final exercise is about sanitizing modify_date values that are "in the future". This is very simple, because your threshold date is 2022-12-30, which is known to be 365 days since 2022-01-01. In other words, any pre-processed modify_date value that is greater than 365, should be capped down to 365.

Doing the final arithmetic:

days_difference = ExpressionTransformer("(X[1] - X[0]) if X[0] <= 365 else (X[1] - 365)")

Can probably be rearranged into:

days_difference = ExpressionTransformer("X[1] - numpy.min(X[0], 365)")
liuhuanshuo commented 2 years ago

I remember commenting that I would expect to see a type check error being thrown... I can't find my comment, but this is exactly the kind of exception that I was hoping to see.

Thank you very much. I think I understand exactly what you mean

I used the code you provided recently and it works very well on part of the dataset, thank you very much

However, it will also report an error in the case of too much time.

Let me get straight to the point and model the following data

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_day_id_pipeline():
    return make_pipeline(ExpressionTransformer("X[1][:4] + '-' + X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_feature_union():
    return FeatureUnion([
        ("modify_date", make_modify_date_pipeline()),
        ("day_id", make_day_id_pipeline())])

mapper_encode = [(['modify_date','day_id'],[make_feature_union(), ExpressionTransformer("(X[1] - X[0]) if (X[0] <= 365 and X[1]>X[0])  else -1")],{'alias':'modify_days'})]

mapper = DataFrameMapper(mapper_encode, input_df=True,df_out=True)

data_test = pd.DataFrame({
    'modify_date':['20220626223702','20220629204300','20220602000000'],
    'day_id':['20220714','20220715','20220914']
})

Now, with mapper on data_test, it works fine

mapper.fit_transform(data_test)

    modify_days
0   18
1   16
2   104

However, if you change a day_id to 2999, you will get an error

data_test_new = pd.DataFrame({
    'modify_date':['20220626223702','20220629204300','20220602000000'],
    'day_id':['20220714','29991231','20221231']
})

mapper.fit_transform(data_test_new)

here is the error code

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/data1/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1978         try:
-> 1979             values, tz_parsed = conversion.datetime_to_datetime64(data)
   1980             # If tzaware, these values represent unix timestamps, so we

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-386-02535a430e61> in <module>
----> 1 mapper.fit_transform(data_test_new)

~/.local/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

~/.local/lib/python3.7/site-packages/sklearn_pandas/dataframe_mapper.py in fit(self, X, y)
    167             if transformers is not None:
    168                 _call_fit(transformers.fit,
--> 169                           self._get_col_subset(X, columns), y)
    170 
    171         # handle features not explicitly selected

~/.local/lib/python3.7/site-packages/sklearn_pandas/pipeline.py in _call_fit(fit_method, X, y, **kwargs)
     22     """
     23     try:
---> 24         return fit_method(X, y, **kwargs)
     25     except TypeError:
     26         # fit takes only one argument

~/.local/lib/python3.7/site-packages/sklearn_pandas/pipeline.py in fit(self, X, y, **fit_params)
     74 
     75     def fit(self, X, y=None, **fit_params):
---> 76         Xt, fit_params = self._pre_transform(X, y, **fit_params)
     77         _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)
     78         return self

~/.local/lib/python3.7/site-packages/sklearn_pandas/pipeline.py in _pre_transform(self, X, y, **fit_params)
     67             if hasattr(transform, "fit_transform"):
     68                 Xt = _call_fit(transform.fit_transform,
---> 69                                Xt, y, **fit_params_steps[name])
     70             else:
     71                 Xt = _call_fit(transform.fit,

~/.local/lib/python3.7/site-packages/sklearn_pandas/pipeline.py in _call_fit(fit_method, X, y, **kwargs)
     22     """
     23     try:
---> 24         return fit_method(X, y, **kwargs)
     25     except TypeError:
     26         # fit takes only one argument

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    932             sum of n_components (output dimension) over transformers.
    933         """
--> 934         results = self._parallel_func(X, y, fit_params, _fit_transform_one)
    935         if not results:
    936             # All transformers are None

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _parallel_func(self, X, y, fit_params, func)
    962             message=self._log_message(name, idx, len(transformers)),
    963             **fit_params) for idx, (name, transformer,
--> 964                                     weight) in enumerate(transformers, 1))
    965 
    966     def transform(self, X):

/data1/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    922                 self._iterating = self._original_iterator is not None
    923 
--> 924             while self.dispatch_one_batch(iterator):
    925                 pass
    926 

/data1/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/data1/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/data1/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/data1/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/data1/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/data1/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    724     with _print_elapsed_time(message_clsname, message):
    725         if hasattr(transformer, 'fit_transform'):
--> 726             res = transformer.fit_transform(X, y, **fit_params)
    727         else:
    728             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    381         """
    382         last_step = self._final_estimator
--> 383         Xt, fit_params = self._fit(X, y, **fit_params)
    384         with _print_elapsed_time('Pipeline',
    385                                  self._log_message(len(self.steps) - 1)):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    311                 message_clsname='Pipeline',
    312                 message=self._log_message(step_idx),
--> 313                 **fit_params_steps[name])
    314             # Replace the transformer of the step with the fitted
    315             # transformer. This is necessary when loading the transformer

/data1/anaconda3/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    353 
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356 
    357     def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    724     with _print_elapsed_time(message_clsname, message):
    725         if hasattr(transformer, 'fit_transform'):
--> 726             res = transformer.fit_transform(X, y, **fit_params)
    727         else:
    728             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

~/.local/lib/python3.7/site-packages/sklearn2pmml/preprocessing/__init__.py in transform(self, X)
     95 
     96         def transform(self, X):
---> 97                 return cast(X, self.dtype)
     98 
     99 class CutTransformer(BaseEstimator, TransformerMixin):

~/.local/lib/python3.7/site-packages/sklearn2pmml/util/__init__.py in cast(X, dtype)
      8         if isinstance(dtype, str) and dtype.startswith("datetime64"):
      9                 func = lambda x: to_pydatetime(x, dtype)
---> 10                 return dt_transform(X, func)
     11         else:
     12                 if not hasattr(X, "astype"):

~/.local/lib/python3.7/site-packages/sklearn2pmml/util/__init__.py in dt_transform(X, func)
     58         if len(shape) > 1:
     59                 X = X.ravel()
---> 60         Xt = func(X)
     61         if isinstance(Xt, Index):
     62                 Xt = Xt.values

~/.local/lib/python3.7/site-packages/sklearn2pmml/util/__init__.py in <lambda>(x)
      7 def cast(X, dtype):
      8         if isinstance(dtype, str) and dtype.startswith("datetime64"):
----> 9                 func = lambda x: to_pydatetime(x, dtype)
     10                 return dt_transform(X, func)
     11         else:

~/.local/lib/python3.7/site-packages/sklearn2pmml/util/__init__.py in to_pydatetime(X, dtype)
     66 
     67 def to_pydatetime(X, dtype):
---> 68         Xt = pandas.to_datetime(X, yearfirst = True, origin = "unix")
     69         if hasattr(Xt, "dt"):
     70                 Xt = Xt.dt

/data1/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

/data1/anaconda3/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, infer_datetime_format, origin, cache)
    792             result = _convert_and_box_cache(arg, cache_array, box)
    793         else:
--> 794             result = convert_listlike(arg, box, format)
    795     else:
    796         result = convert_listlike(np.array([arg]), box, format)[0]

/data1/anaconda3/lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    461             errors=errors,
    462             require_iso8601=require_iso8601,
--> 463             allow_object=True,
    464         )
    465 

/data1/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1982             return values.view("i8"), tz_parsed
   1983         except (ValueError, TypeError):
-> 1984             raise e
   1985 
   1986     if tz_parsed is not None:

/data1/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1973             dayfirst=dayfirst,
   1974             yearfirst=yearfirst,
-> 1975             require_iso8601=require_iso8601,
   1976         )
   1977     except ValueError as e:

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2999-12-31 00:00:00

I definitely know that the error is caused by this 2999, but I don't know how to deal with it.

In fact, I can understand the error and I searched the error code and found many solutions, but they are all based on the pandas function. Based on my previous experience, I don't know whether these methods can be supported or not.

Since no relevant posts have such problems when using sklearn2pmml, I need your help.

I wonder if CastTransformer caused the problem and if CastTransformer has a parameter that can change a value like 2099 to a specified value.

vruusmann commented 2 years ago

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

This error happens in the Python side, inside the Pandas library. It refuses to accept string values as datetime_to_datetime64(..) arguments.

29991231

Does the Pandas parse succeed when you omit this obviously incorrect value element?

Perhaps Pandas also contains some data sanitization code that accepts 20220714 (looks like a reasonable date) but rejects 29991231 (doesn't look like a reasonable date).

Perhaps Pandas would try harder is it was given an ISO 8601-like date string like 2999-12-31.

vruusmann commented 2 years ago

I wonder if CastTransformer caused the problem and if CastTransformer has a parameter that can change a value like 2099 to a specified value.

Sanitize both your modify_date and date_id values into ISO 8601 date strings YYYY-MM-DD vefore feeding them to CastTransformer(dtype = "datetime64[D]").

If the Pandas library refuses to parse 2999-12-31, you should look into Pandas source code and possibly open a new issue with the Pandas project.

Write a unit test for all possible combinations that you have tried. Right now you seem to be struggling with code pieces that were working OK before.

liuhuanshuo commented 2 years ago

Does the Pandas parse succeed when you omit this obviously incorrect value element?

It looks like you can use pandas to convert,Because the following code executes correctly

pd.to_datetime(pd.DataFrame(['20991231'])[0], errors = 'coerce')
--------------
0   2099-12-31
Name: 0, dtype: datetime64[ns]

I think this goes back to the fact that int can't be used,

The code below works fine because I used numpy.array(X[0]).astype('int')<20221230 to convert all dates like 20991231 to 20221230

data_test_new = pd.DataFrame({
    'modify_date':['20220626','29991231','20220602'],
    'day_id':['20220714','20220715','20220914']
})

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if (len(X[0]) > 0 and numpy.array(X[0]).astype('int')<20221230) else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_day_id_pipeline():
    return make_pipeline(ExpressionTransformer("X[1][:4] + '-' + X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_feature_union():
    return FeatureUnion([
        ("modify_date", make_modify_date_pipeline()),
        ("day_id", make_day_id_pipeline())])

mapper_encode = [(['modify_date','day_id'],[make_feature_union(), ExpressionTransformer("(X[1] - X[0]) if (X[0] <= 365 and X[1]>X[0])  else -1")],{'alias':'modify_days'})]

mapper = DataFrameMapper(mapper_encode, input_df=True,df_out=True)

mapper.fit_transform(data_test_new)

Unfortunately, an error occurred while converting to pmml, prompting

Exception in thread "main" java.lang.IllegalArgumentException: Function 'numpy.array' is not supported

I am about to collapse, I think this is a very simple task, really has been unable to complete!

liuhuanshuo commented 2 years ago

Actually, my idea is simple.

All I need to do is add a condition somewhere in the code below (which should be the original location) to change 29991231 to 20221230. But no matter how I tried, I couldn't succeed. Even if successful, it cannot be converted to pmml.

I am in the process of converting the company related algorithm model to pmml and I almost crashed in this small place!

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0  else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
liuhuanshuo commented 2 years ago

I'm using a very stupid method now

Is the use ofExpressionTransformer("X[0] if X[0] != '2999-12-31' else '2022-12-30'")

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'"), ExpressionTransformer("X[0] if X[0] != '2999-12-31' else '2022-12-30'"),CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

Now it's finally working

fit_transform no problem!

No problem converting pmml!

However, when invoked, the following error is still displayed

JavaError: java.lang.IllegalArgumentException: 2.02-20-40

I clearly have not according to your instructions, to solve the problem, why is it still like this!

I'm falling apart!

liuhuanshuo commented 2 years ago

以下可能会有所帮助:

您实际上应该将这两行合二为一:

mapper = DataFrameMapper([
  (['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()), make_feature_union(), ExpressionTransformer("X[1] - X[0]")]),
])

This will cause an error, I have upgraded to the latest version


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 14>()
8 def make_feature_union():
9     return FeatureUnion([
10         ("modify_date", make_modify_date_pipeline()),
11         ("day_id", make_day_id_pipeline())])
---> 14 mapper_encode = [(['modify_date','day_id'],[MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()),make_feature_union(), ExpressionTransformer("(X[1] - X[0]) if (X[0] <= 365 and X[1]>X[0])  else -1")],{'alias':'modify_days'})]
16 mapper = DataFrameMapper(mapper_encode, input_df=True,df_out=True)

TypeError: init() takes 2 positional arguments but 3 were given

liuhuanshuo commented 2 years ago

I've tried every transformer in sklearn2pmml.decoration.

Anyway, I finally found a method that allowed me to convert pmml files successfully and also work with java calls.

Just add the following code at the beginning

StringNormalizer(function = None)

So that's it

(['modify_date','day_id'],[StringNormalizer(function = None),make_feature_union(),  ExpressionTransformer("(X[1] - X[0]) if (X[0] &lt;= 365 and X[1]&gt;X[0])  else -1")],{'alias':'modify_days'}),

I don't know why it works. I've spent so much time on it that I don't have the energy to figure out why it works.

But with the addition of this one piece of code, my system worked.

Anyway, I want to thank you! Thank you for developing such a great package!

vruusmann commented 2 years ago

It looks like you can use pandas to convert,Because the following code executes correctly

PMML operates similarly to pandas.to_datetime(.., errors = "raise"). Therefore, it doesn't matter if Pandas is able to do some clever heuristics in errors = "coerce" mode, because it's inaccessible.

Here's my unit test:

# Fails with pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2999-12-31 00:00:00 present at position 0
pandas.to_datetime("29991231", errors = "raise")

# Succeeds, kind of. The result is NaT
pandas.to_datetime("29991231", errors = "coerce")
vruusmann commented 2 years ago

Unfortunately, an error occurred while converting to pmml, prompting

Exception in thread "main" java.lang.IllegalArgumentException: Function 'numpy.array' is not supported

It's impossible to use inline cast functions such as builtins.int(..) and builtins.str(..) inside ExpressionTransformer expression due to https://github.com/jpmml/jpmml-python/issues/20.

That's a clever "hack", trying to replace int(..) with numpy.array(..).astype(int), but it runs into exactly the same technical limitation - this function cannot be expressed without creating a standalone DerivedField element (which isn't currently supported).

The inline cast is blocked because of this: http://mantis.dmg.org/view.php?id=169

vruusmann commented 2 years ago

This will cause an error, I have upgraded to the latest version

mapper_encode = [(['modify_date','day_id'],[MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()),make_feature_union(), ExpressionTransformer("(X[1] - X[0]) if (X[0] <= 365 and X[1]>X[0]) else -1")],{'alias':'modify_days'})]

Did you see https://github.com/jpmml/jpmml-evaluator-python/issues/16#issuecomment-1297504622?

I told you that the MultiDomain constructor takes a single argument, which is a list of child decorators.

You're passing two child decorators, without wrapping them into a list. Of course it won't work.

vruusmann commented 2 years ago

Just add the following code at the beginning

StringNormalizer(function = None)

You could use CastTransformer(dtype = str) with exactly the same effect (convert any value to string, aka "format as string").

vruusmann commented 2 years ago

However, when invoked, the following error is still displayed

JavaError: java.lang.IllegalArgumentException: 2.02-20-40

Did you see https://github.com/jpmml/jpmml-evaluator-python/issues/16#issuecomment-1296589624?

If you format float(20220714) as string, you get 2.0220714E7 (floating-point value, in scientific notation). And the [0:4] subsrting of this value is 2.02. Everything works just as expected.

Now, if you format int(20220714) as string, you get 20220714 (integer value). The [0:4] substring of it is 2022.

vruusmann commented 2 years ago

Marking as "resolved".

The troubled user still doesn't appear to grasp the functional difference between integer and floating-point value spaces (one of them is suitable for emulating dates/datetimes, the other is not), but it's beyond my capacity to provide the necessary education here.

I'm sure life will teach him well!