jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

LightGBM integration #107

Closed Archmagegck closed 2 years ago

Archmagegck commented 3 years ago

My dataset contains some string features, and I used StringIndexers to encode them. When I put the StringIndexers, VectorAssembler, and LightGBM in a pipeline, and fit the pipeline, everything is ok. image

But, when I want to saved the pipeline into PMML, an error occurred. The error log is: Py4JJavaError: An error occurred while calling o16875.buildFile. : java.lang.IllegalArgumentException: Field devinfov3_general_data_locale_iso_3_country has data type string image

devinfov3_general_data_locale_iso_3_country is one of the string features.

vruusmann commented 3 years ago

You can probably work around this specific error message by "downgrading" categorical features from to the one-hot-encoded representation (StringIndexer + OneHotEncoder).

In the long term, the JPMML-SparkML library should include a specialized LightGBM integration (that knows about LightGBM's ability to accept categorical features as-is).

Archmagegck commented 3 years ago

I'm sorry for I open the issue in a wrong place. Actually, this is an issue for jpmml-lightgbm. I have tried StringIndexer + OneHotEncoder to encode the string features, but the problem persists. I opened a new issue at jpmml-lightgbm(https://github.com/jpmml/jpmml-lightgbm/issues/47).

vruusmann commented 2 years ago

Fixed in https://github.com/jpmml/jpmml-sparkml/commit/7635defa820eb4a7484adcf200d9ba2c41f57ff5