jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Support for 32-bit float type? #46

Closed USCYuandaDu closed 1 year ago

USCYuandaDu commented 6 years ago

Hi @vruusmann , I have java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type when the train data's schema is float type I think we should add floatType to include more general circumstances, maybe could change code here? https://github.com/jpmml/jpmml-sparkml/blob/f6af69be7ccc2cdffe5d37c1790b04d1d10f9c23/src/main/java/org/jpmml/sparkml/SparkMLEncoder.java#L82-L95 Thank you! Bests, Yuanda

vruusmann commented 6 years ago

Rejecting float columns is a defensive decision.

All Apache Spark ML algorithms operate on double inputs only. Even if your dataset contains float values, then will be automatically cast to double values (eg. by VectorAssembler) before any algorithm sees them.

The XGBoost algorithm is an "external" algorithm, and is possibly able to operate on float values directly. However, you would need to train the XGBoost model using RDD API, not DataFrame/Dataset API, to ensure that all float values survive.

vruusmann commented 6 years ago

Long story short, when using Apache Spark ML standard APIs (eg. Pipeline), then all your numeric values are converted to double values sooner or later. This exception encourages you to make this conversion explicit (ie. cast float columns manually to double columns).

However, let's keep this issue open, and check regularly when/if Apache Spark ML standard APIs become more flexible around here. For example, when VectorAssembler will be able to produce float[] vectors in addition to double[] vectors.

USCYuandaDu commented 6 years ago

@vruusmann sounds great, what about have a little change here ? https://github.com/jpmml/jpmml-sparkml-xgboost/issues/6