Cannot Load model using PySpark xgboost4j

WillSmisi commented 3 years ago

Background

I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form.

The training and saving is done, but It seems I cannot load the model.

Current libraries versions:

Pyspark 2.4.5
xgboost4j 0.91
xgboost4j-spark 0.91

The main process is as follow:

trainingData, testData = data.randomSplit([0.7,0.3])
vectorAssembler = VectorAssembler()
      .setInputCols(numeric_features_new) 
      .setOutputCol(FEATURES)
scaler = MinMaxScaler(inputCol = FEATURES,
                      outputCol = FEATURES + '_scaler')
assemblerInputCols = FEATURES + '_scaler'

xgb_params = dict(
        eta=0.1,
        maxDepth=2,
        missing=0.0,
        objective="binary:logistic",
        numRound=5,
        numWorkers=1
    )

xgb = (
      XGBoostClassifier(**xgb_params)
          .setFeaturesCol(assemblerInputCols)
          .setLabelCol(LABEL)
  )

pipeline = Pipeline(stages=[
             vectorAssembler,
             scaler,
             xgb
           ])
print "training model"
pipline_model = pipeline.fit(trainingData)
print "saving model to S3"
pipline_model.write().overwrite().save(modelOssDir)
print "saved model to S3"
print "Loading model..."
pipline_model = PipelineModel.load(modelOssDir)

The error I get:

Traceback (most recent call last):
  File "xgboost.py", line 95, in <module>
    pipline_model = PipelineModel.load(modelOssDir)
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 362, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 242, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 304, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 299, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 227, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 221, in __get_class
ImportError: No module named ml.dmlc.xgboost4j.scala.spark
at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:50)
    at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:131)
    at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.pollAMStatus(YarnClientImplUtil.java:108)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:377)
    ... 12 more
21/01/22 11:39:21 ERROR Client: Application diagnostics message: Failed to contact YARN for application application_1611286494541_745555769.
Exception in thread "main" org.apache.spark.SparkException: Application application_1611286494541_745555769 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am searching for a long time on net.But no use. Please help or try to give some ideas how to achieve this.

thanks in advance.

sllynn commented 3 years ago

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

WillSmisi commented 3 years ago

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

I guess so.I succeed in training xgboost model and uploading model,but failed to load xgboost model.

WillSmisi commented 3 years ago

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

Thanks for your reply,have you tried to save model and then load it?

akshayparanjape commented 3 years ago

You can save the model as below:

pipe = Pipeline(stages = stages + [xgb])
model = pipe.fit(data)
model.write().overwrite().save(modelpath)

and load it later as :

from pyspark.ml import PipelineModel
model = PipelineModel.load(modelpath)

This worked for me.

You and also directly save and load XGBoostClassifieror XGBRegressorsince they have JavaWriteras the parent class One point to be noted here is that if you are training on a distributed system then you will have to save the model on a distributed storage system like HDFS or Amazon S3

sllynn / spark-xgboost