jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 44 forks source link

[bug] incorrect rounding of small negative float #65

Closed lev112 closed 2 years ago

lev112 commented 2 years ago

my xgboost model creates a split on x < -9.53674316e-07 (1/(2**20)) , but when that leaf is encoded into the pmml, it is rounded to 0 (the condition is x >= 0, instead of x >= -9.53674316e-07).

for most values of x this is fine, but when x=0 (which is a very common value in my data set) this changes the behavior of the model vs the pmml.

versions used:

vruusmann commented 2 years ago

Transferred this issue from the JPMML-SkLearn project to the JPMML-XGBoost project, because the rounding logic, resides here.

my xgboost model creates a split on x < -9.53674316e-07 (1/(2**20))

How do you know that -9.53674316e-07 is the correct value? Did you dump your XGBoost model in some text data format, and observed the split threshold value manually?

Also, what is the native data type of your feature (integer, float32 or float64)? Is it being subjected to some pre-processing activities that might change its data type?

when that leaf is encoded into the pmml, it is rounded to 0 (the condition is x >= 0, instead of x >= -9.53674316e-07)

The JPMML-XGBoost library reads the split value from XGBoost dump file (using the Float#intBitsToFloat(int) method), and definitely does not add or lose any precision in this point.

The loaded float threshold value is then rounded using exactly the same logic as in XGBoost C(++) code: https://github.com/jpmml/jpmml-xgboost/blob/1.7.0/pmml-xgboost/src/main/java/org/jpmml/xgboost/RegTree.java#L372-L391

Please note that the rounding is only applied to integer data type features. All float data type values should pass through unchanged there.

TLDR: Please report back some details about your features. Might be a data pre-processing issue instead.

vruusmann commented 2 years ago

The loaded float threshold value is then rounded using exactly the same logic as in XGBoost C(++) code

Important point - the XGBoost library had changed its integer rounding behaviour some time ago, and some JPMML-XGBoost 1.5.X and 1.6.X library versions were not aware of this.

However, the latest JPMML-XGBoost 1.7.X development branch should be all good, thanks to he following code change: https://github.com/jpmml/jpmml-xgboost/commit/04601610352a0548b60e9310c01c535fd1e0c1e5

lev112 commented 2 years ago

thank you for the clarification @vruusmann

I was getting the splits using the dump_model function of the booster

after your comment I realized that I had the feature names confused, and the feature was actually OHE, and because it was treated as categorical in the pmml, I miss read it

thanks again for the fast response

vruusmann commented 2 years ago

I was getting the splits using the dump_model function of the booster

IIRC, this function accepts a "feature map" argument, and changes its formatting behaviour according to feature definitions contained therein.

For example, if you supply a feature map where the feature data type is integer, then it should be rounded down to 0 in dump_model function output too. If you supply it as float (or don't supply anything at all), then it will be preserved as -9.53674316e-07.

the feature was actually OHE, and because it was treated as categorical in the pmml

OHE produces instances of org.jpmml.converter.BinaryFeature, which can be safely formatted using a simplified algorithm: https://github.com/jpmml/jpmml-xgboost/blob/1.7.0/pmml-xgboost/src/main/java/org/jpmml/xgboost/RegTree.java#L316-L323

Anyway, as of today, you should upgrade your XGBoost library to 1.5.X or newer, and stop using external OHE at all!