Closed USCYuandaDu closed 1 year ago
Rejecting float
columns is a defensive decision.
All Apache Spark ML algorithms operate on double
inputs only. Even if your dataset contains float
values, then will be automatically cast to double
values (eg. by VectorAssembler
) before any algorithm sees them.
The XGBoost algorithm is an "external" algorithm, and is possibly able to operate on float
values directly. However, you would need to train the XGBoost model using RDD API, not DataFrame/Dataset API, to ensure that all float
values survive.
Long story short, when using Apache Spark ML standard APIs (eg. Pipeline
), then all your numeric values are converted to double
values sooner or later. This exception encourages you to make this conversion explicit (ie. cast float
columns manually to double
columns).
However, let's keep this issue open, and check regularly when/if Apache Spark ML standard APIs become more flexible around here. For example, when VectorAssembler
will be able to produce float[]
vectors in addition to double[]
vectors.
@vruusmann sounds great, what about have a little change here ? https://github.com/jpmml/jpmml-sparkml-xgboost/issues/6
Hi @vruusmann , I have
java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type
when the train data's schema is float type I think we should add floatType to include more general circumstances, maybe could change code here? https://github.com/jpmml/jpmml-sparkml/blob/f6af69be7ccc2cdffe5d37c1790b04d1d10f9c23/src/main/java/org/jpmml/sparkml/SparkMLEncoder.java#L82-L95 Thank you! Bests, Yuanda