ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.31k stars 5.63k forks source link

Saving XGBoost model with json extension #41374

Open farridav opened 10 months ago

farridav commented 10 months ago

Description

Im needing to use categorical encoding in my xgboost model, but when i do, the checkpoints that the trainer is saving, fail, as the model is not saved in json/ubjson format, heres what i get:

xgboost.core.XGBoostError: [18:26:58] ../src/tree/tree_model.cc:869: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.

unfortunately, im not able to change the model name that is saved with the checkpoint, as it is hardcoded to MODEL within ray.air.constants.MODEL_KEY https://github.com/ray-project/ray/blob/master/python/ray/air/constants.py#L5C6-L5C6

is there a way for me to save the model checkpoints with a json extension ? or overide this somehow ?

Heres an excerpt from my implementation:

trainer = XGBoostTrainer(
    dmatrix_params={"train": {"enable_categorical": True}},
    scaling_config=ScalingConfig(
        num_workers=workers,
        use_gpu=False,
        trainer_resources={"CPU": 0},
        resources_per_worker={"CPU": cpus_per_workers},
    ),
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(num_to_keep=3, checkpoint_at_end=True),
        sync_config=SyncConfig(upload_dir=f"{bucket}/sync_config"),
        name="model.json",
    ),
    label_column="units",
    num_boost_round=2,
    params={
        "objective": "reg:squarederror",
        "eval_metric": ["rmse", "mae"],
        "tree_method": "hist",
        "max_depth": 12,
        "eta": 0.01,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
    },
    datasets={"train": datasets[DataSplit.TRAIN]},
)
result: Result = trainer.fit()

Use case

Training an XGBoost model, that uses categorically encoded data, saving it, then running batch predictions from it in a seperate step

farridav commented 10 months ago

For the benefit of others, ive managed to solve this problem with the following implementation:

        class MyXGBoostTrainer(XGBoostTrainer):
            def _save_model(self, model: xgboost.Booster, path: str) -> None:
                model.save_model(path + ".ubj")

Then using that instead, ill leave this ticket open, in case there is a cleaner, more config driven approach to this, thanks

matthewdeng commented 10 months ago

Hey @farridav , can you share which version of Ray you are using?

If you are using the latest version, I believe it should be saving it with the .json prefix already.

https://github.com/ray-project/ray/blob/8919bf0d1f12b6fbf515b6364c873544cf0ca25b/python/ray/train/xgboost/xgboost_trainer.py#L109

https://github.com/ray-project/ray/blob/8919bf0d1f12b6fbf515b6364c873544cf0ca25b/python/ray/train/xgboost/xgboost_checkpoint.py#L18

farridav commented 10 months ago

I'm currently pinned to 2.4, though I'll see if my vendor can help us get upgraded.

Even when we do though, I imagine the same difficulty when trying to utilise the .ubj format for model saving.

Are there any plans to make this property configurable? We also hit the same constraints within the Batch predictor.

Thanks for looking into this

matthewdeng commented 10 months ago

Hm given that UBJ is now the default for XGBoost, would it be satisfactory if we just updated the checkpoint to .ubj, or is there still a need for configurability?

matthewdeng commented 10 months ago

Also as an FYI the BatchPredictor interface is also now deprecated, see https://github.com/ray-project/ray/issues/37489. The new recommended pattern should allow you the flexibility to define how you load the checkpoint with your own custom behavior.

farridav commented 10 months ago

Default ubj or JSON satisfies my use case, though I can't speak for other use cases.

Thanks for the heads up on BatchPredictor, I'll move towards that pattern.