A model that was trained on a dense dataset makes incorrect predictions for sparse datasets

SamWqc commented 3 years ago

Hi, I found that the prediction results produce by python lightgbm model and pmml file is different. It happens when training data did not contain missing value but predict the data which contains missing value.

Here is the example to show this case.

vruusmann commented 3 years ago

from pypmml import Model

@SamWqc The JPMML software project is not a place where to complain about third-party projects. Your reported results have no relevance here.

If you keep spamming the JPMML software project, you will be blocked.

SamWqc commented 3 years ago

@vruusmann
I am so sorry to bother you and I think there is maybe some misunderstanding. I did not want to spam the project at all. But I still found the same problem when using jpmml_evaluator.
I hope you could have a look. Thanks!!!!

#######
import lightgbm as lgb
import pandas as pd
import numpy as np
import pandas
import joblib
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from jpmml_evaluator.py4j import launch_gateway, Py4JBackend
from jpmml_evaluator import make_evaluator

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
Y = np.random.random_integers(0,1,1000)

my_model = lgb.LGBMClassifier(n_estimators=100)
my_model.fit(X,Y,feature_name=fea_name)

mapper = DataFrameMapper([([i], None) for i in fea_name])  

pipeline = PMMLPipeline([
    ('mapper', mapper), 
    ("classifier", my_model)
])

sklearn2pmml(pipeline, "lgb.pmml")

#####load pmml#####
gateway = launch_gateway()
backend = Py4JBackend(gateway)

evaluator = make_evaluator(backend, "lgb.pmml") \
.verify()

#evaluate with missing value
np.random.seed(9999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:None})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))

     my_model_pred  Jpmml_model_pred  pred_diff
321       0.869994          0.049991   0.820004
628       0.873887          0.056304   0.817583
704       0.974523          0.169809   0.804715
984       0.924378          0.131011   0.793367
893       0.822017          0.029407   0.792610
682       0.044943          0.826341   0.781398
921       0.903011          0.128266   0.774745
995       0.155294          0.925298   0.770004
844       0.856560          0.089665   0.766896
963       0.938739          0.173073   0.765666

#evaluate without missing value and set missing as 0
np.random.seed(999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:0})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))

     my_model_pred  Jpmml_model_pred  pred_diff
0         0.242943          0.242943        0.0
671       0.458703          0.458703        0.0
658       0.807326          0.807326        0.0
659       0.748976          0.748976        0.0
660       0.690734          0.690734        0.0
661       0.608443          0.608443        0.0
662       0.625638          0.625638        0.0
663       0.706605          0.706605        0.0
664       0.855556          0.855556        0.0
665       0.259897          0.259897        0.0

vruusmann commented 3 years ago

@SamWqc But I still found the same problem when using jpmml_evaluator.

That's the correct way of doing things!

I moved this issue to the JPMML-LightGBM project, because it looks like a LGBM-to-PMML conversion issue. Specifically, the "default child" instruction is wrong - it is "send missing values to the left", but it should be "send missing values to the right".

This issue manifests itself when the LGBM model was trained on a dataset that DID NOT contain any missing values.

See for yourself, if you insert some missing values into the training dataset, then JPMML-Evaluator predictions will be correct in both cases:

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
# THIS!
X[X<5]=np.nan
Y = np.random.random_integers(0,1,1000)

SamWqc commented 3 years ago

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction. And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
  <MiningSchema>
    <MiningField name="Fea1"/>
    <MiningField name="Fea2"/>
    <MiningField name="Fea3"/>
    <MiningField name="Fea4"/>
    <MiningField name="Fea6"/>
    <MiningField name="Fea7"/>
    <MiningField name="Fea8"/>
    <MiningField name="Fea9"/>
    <MiningField name="Fea10"/>
    <MiningField name="Fea12"/>
    <MiningField name="Fea14"/>
    <MiningField name="Fea15"/>
    <MiningField name="Fea16"/>
    <MiningField name="Fea18"/>
    <MiningField name="Fea19"/>
    <MiningField name="Fea20"/>
  </MiningSchema>
  <Node score="-0.09482225077425536">
    <True/>
    <Node score="0.1452248008167614">
      <SimplePredicate field="Fea3" operator="greaterThan" value="-21.8774881362915"/>
      <Node score="0.09188101157431317">
        <SimplePredicate field="Fea6" operator="greaterThan" value="-19.934649467468258"/>
        <Node score="-0.10302898758078581">
          <SimplePredicate field="Fea7" operator="greaterThan" value="-12.244786739349363"/>
          <Node score="-0.06281597722878647">
            <SimplePredicate field="Fea4" operator="greaterThan" value="-12.073745250701903"/>
            <Node score="-0.10815819808486733">
              <SimplePredicate field="Fea4" operator="greaterThan" value="2.551533937454224"/>
              <Node score="-0.04641770127647833">
                <SimplePredicate field="Fea8" operator="greaterThan" value="8.414263725280763"/>
              </Node>
              <Node score="-0.07481832980833732">
                <SimplePredicate field="Fea8" operator="greaterThan" value="-7.9984328746795645"/>
                <Node score="0.1636899586314549">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="-14.471662521362303"/>
                  <Node score="0.07521107743604809">
                    <SimplePredicate field="Fea2" operator="greaterThan" value="-4.377085924148559"/>
                    <Node score="0.14189081398910836">
                      <SimplePredicate field="Fea15" operator="greaterThan" value="10.133028984069826"/>
                    </Node>
                    <Node score="0.09664384989953179">
                      <SimplePredicate field="Fea16" operator="greaterThan" value="7.464995622634889"/>
                    </Node>
                    <Node score="0.09794280580640959">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="5.7242188453674325"/>
                    </Node>
                    <Node score="-0.15578658133705325">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="1.9439340829849245"/>
                    </Node>
                    <Node score="-0.06815035615303128">
                      <SimplePredicate field="Fea14" operator="greaterThan" value="0.9443753361701966"/>
                    </Node>
                  </Node>
                  <Node score="-0.022427108230932868">
                    <SimplePredicate field="Fea1" operator="greaterThan" value="-2.777882814407348"/>
                    <Node score="0.1252208798508433">
                      <SimplePredicate field="Fea4" operator="greaterThan" value="7.320906639099122"/>
                    </Node>
                  </Node>
                </Node>
              </Node>
              <Node score="0.019794809895329203">
                <SimplePredicate field="Fea10" operator="greaterThan" value="-1.9369573593139646"/>
              </Node>
            </Node>
            <Node score="0.0652091169530891">
              <SimplePredicate field="Fea9" operator="greaterThan" value="11.225318908691408"/>
              <Node score="-0.08833449262314681">
                <SimplePredicate field="Fea20" operator="greaterThan" value="-0.8956793844699859"/>
              </Node>
            </Node>
            <Node score="-0.03568022357067151">
              <SimplePredicate field="Fea2" operator="greaterThan" value="-15.326576232910154"/>
              <Node score="-0.030809703683317567">
                <SimplePredicate field="Fea18" operator="greaterThan" value="-5.7649521827697745"/>
                <Node score="0.12866983174151886">
                  <SimplePredicate field="Fea14" operator="greaterThan" value="5.901069641113282"/>
                  <Node score="-0.05868613548098403">
                    <SimplePredicate field="Fea18" operator="greaterThan" value="2.781560301780701"/>
                  </Node>
                </Node>
                <Node score="0.1842068006477812">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="8.100379943847658"/>
                </Node>
                <Node score="0.18886971928785534">
                  <SimplePredicate field="Fea16" operator="greaterThan" value="8.023637294769289"/>
                </Node>
                <Node score="0.10299430099982321">
                  <SimplePredicate field="Fea10" operator="greaterThan" value="-6.957179784774779"/>
                </Node>
              </Node>
              <Node score="0.0954853216582624">
                <SimplePredicate field="Fea19" operator="greaterThan" value="0.7611226439476014"/>
              </Node>
            </Node>
          </Node>
          <Node score="0.011865327710640965">
            <SimplePredicate field="Fea12" operator="greaterThan" value="-1.6984871029853819"/>
          </Node>
        </Node>
        <Node score="0.1680864247778106">
          <SimplePredicate field="Fea9" operator="greaterThan" value="6.78233814239502"/>
        </Node>
        <Node score="-0.018285509687264518">
          <SimplePredicate field="Fea9" operator="greaterThan" value="0.5637701749801637"/>
        </Node>
      </Node>
    </Node>
  </Node>
</TreeModel>

vruusmann commented 3 years ago

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction.

You can choose between different PMML representation when converting by toggling the compact flag:

pipeline = PMMLPipeline(..)
pipeline.fit(X, y)
# THIS
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "lgbm.pmml")

Both compacted and non-compacted PMML representations suffer from the abovestated issue.

And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

Missing values are sent to the left or right child node depending on the value of the MASK_DEFAULT_LEFT value: https://github.com/jpmml/jpmml-lightgbm/blob/1.3.11/src/main/java/org/jpmml/lightgbm/Tree.java#L136

The question is that why LightGBM is setting the MASK_DEFAULT_LEFT value differently for dense vs sparse training datasets. Or perhaps there's some super-flag that overrides the MASK_DEFAULT_LEFT value in special cases.

vruusmann commented 3 years ago

@SamWqc TLDR: If your testing datasets contains missing values then your training dataset should also contain missing values.

It seems to me like a flawed assumption that you can train with dense only, and then test both with dense AND sparse. No algorithm is guaranteed to have such generalization powers.

SamWqc commented 3 years ago

Yes. I think MASK_DEFAULT_LEFT is not enough. LGBM will first look at the missing type. If the missing type is None, the missing value will be converted into 0 AND missing direction did not work. Missing value handling in LGBM: (https://github.com/microsoft/LightGBM/issues/2921#issuecomment-607556348)

jpmml / jpmml-lightgbm

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51