ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.23k stars 5.62k forks source link

[air] Error while loading xgboost model in BatchPredictor #34307

Open shashwat-nks opened 1 year ago

shashwat-nks commented 1 year ago

What happened + What you expected to happen

We are saving and loading a XGBoostTrainer trained model as below, however facing error while loading it most of the times(it works and is able to predict some of the time). This is preventing us from predicting using a saved model.

Versions / Dependencies

2.3.1

Reproduction script

Save:

"""Save the model fit using the trainer
"""
model_name = self.model_cfg.get("model_name")
pickle.dump(self.result, open(model_name,"wb"))
ckp = XGBoostCheckpoint.from_checkpoint(self.result.checkpoint)
ckp.get_model().save_model(model_name + ".xgb")

Load and Predict:

model = xgb.Booster()
model_name = self.model_cfg.get("model_name")
print("========== Loading model ===========" , model_name)
model.load_model(model_name + ".xgb")
ckpt = XGBoostCheckpoint.from_model(model)
batch_predictor = BatchPredictor.from_checkpoint(
            ckpt, XGBoostPredictor
        )
predicted_labels = (
            batch_predictor.predict(test_ds)
        )

Error being faced: split_1679035020608/work/src/tree/tree_model.cc:837: Check failed: fi->Read(dmlc::BeginPtr(nodes_), sizeof(Node) * nodes_.size()) == sizeof(Node) * nodes_.size() (980 vs. 10220) :

Issue Severity

High: It blocks me from completing my task.

xwjiang2010 commented 1 year ago

what happens if you do the following?

batch_predictor = BatchPredictor.from_checkpoint(
            self.result.checkpoint, XGBoostPredictor
        )
predicted_labels = (
            batch_predictor.predict(test_ds)
        )
shashwat-nks commented 1 year ago

Expected behaviour is observed in the above, i.e., when we train and predict on the go we are able to get the predictions.

xwjiang2010 commented 1 year ago

I mean what if you don't do the intermediate conversion to native xgboost model and just get self.result.checkpoint and feed it into batch_predictor like I showed above. Have you tried that?