Closed lev112 closed 2 years ago
Transferred this issue from the JPMML-SkLearn project to the JPMML-XGBoost project, because the rounding logic, resides here.
my xgboost model creates a split on x < -9.53674316e-07 (1/(2**20))
How do you know that -9.53674316e-07
is the correct value? Did you dump your XGBoost model in some text data format, and observed the split threshold value manually?
Also, what is the native data type of your feature (integer, float32 or float64)? Is it being subjected to some pre-processing activities that might change its data type?
when that leaf is encoded into the pmml, it is rounded to 0 (the condition is x >= 0, instead of x >= -9.53674316e-07)
The JPMML-XGBoost library reads the split value from XGBoost dump file (using the Float#intBitsToFloat(int)
method), and definitely does not add or lose any precision in this point.
The loaded float threshold value is then rounded using exactly the same logic as in XGBoost C(++) code: https://github.com/jpmml/jpmml-xgboost/blob/1.7.0/pmml-xgboost/src/main/java/org/jpmml/xgboost/RegTree.java#L372-L391
Please note that the rounding is only applied to integer
data type features. All float
data type values should pass through unchanged there.
TLDR: Please report back some details about your features. Might be a data pre-processing issue instead.
The loaded float threshold value is then rounded using exactly the same logic as in XGBoost C(++) code
Important point - the XGBoost library had changed its integer rounding behaviour some time ago, and some JPMML-XGBoost 1.5.X and 1.6.X library versions were not aware of this.
However, the latest JPMML-XGBoost 1.7.X development branch should be all good, thanks to he following code change: https://github.com/jpmml/jpmml-xgboost/commit/04601610352a0548b60e9310c01c535fd1e0c1e5
thank you for the clarification @vruusmann
I was getting the splits using the dump_model
function of the booster
after your comment I realized that I had the feature names confused, and the feature was actually OHE, and because it was treated as categorical in the pmml, I miss read it
thanks again for the fast response
I was getting the splits using the dump_model function of the booster
IIRC, this function accepts a "feature map" argument, and changes its formatting behaviour according to feature definitions contained therein.
For example, if you supply a feature map where the feature data type is integer
, then it should be rounded down to 0
in dump_model
function output too. If you supply it as float
(or don't supply anything at all), then it will be preserved as -9.53674316e-07
.
the feature was actually OHE, and because it was treated as categorical in the pmml
OHE produces instances of org.jpmml.converter.BinaryFeature
, which can be safely formatted using a simplified algorithm:
https://github.com/jpmml/jpmml-xgboost/blob/1.7.0/pmml-xgboost/src/main/java/org/jpmml/xgboost/RegTree.java#L316-L323
Anyway, as of today, you should upgrade your XGBoost library to 1.5.X or newer, and stop using external OHE at all!
my xgboost model creates a split on
x < -9.53674316e-07
(1/(2**20)
) , but when that leaf is encoded into the pmml, it is rounded to0
(the condition isx >= 0
, instead ofx >= -9.53674316e-07
).for most values of
x
this is fine, but whenx=0
(which is a very common value in my data set) this changes the behavior of the model vs the pmml.versions used: