sb-ai-lab / Py-Boost

Python based GBDT implementation on GPU. Efficient multioutput (multiclass/multilabel/multitask) training
Apache License 2.0
156 stars 13 forks source link

Warm start in Py-Boost #14

Closed GradOpt closed 8 months ago

GradOpt commented 1 year ago

Hello, thank you for your efficient implementation of GBDT algorithm and handling of multioutput problems. I have a question about how to perform warm restart training with Py-boost (SketchBoost), that is, how to fit further on an already fitted model, similar to warm_start in sklearn or init_model in xgboost and lightgbm. Are there any related parameters or how to implement it? Furthermore, does Py-boost support, or how can I modify it to train models as DART (Dropouts meet Multiple Additive Regression Trees) supported in xgboost and lightgbm (drop some trees during training and refit the model, thereby balancing the contribution of each tree in the ensemble)?

btbpanda commented 1 year ago

Hello @OswaldHongyu. Thanks for your feedback and my apologies for the late reply. Both features are not supported by default, but warm_start behavior is very easy to imitate using callbacks, look at the example below. Maybe in the later releases I will add this feature as a parameter

import cupy as cp
from py_boost import Callback
from copy import copy

class WarmStart(Callback):

    def __init__(self, model):

        model.to_cpu()
        self.model = copy(model)
        self.model.postprocess_fn = lambda x: x

    def before_train(self, build_info):

        build_info['model'].base_score = cp.asarray(self.model.base_score)

        train = build_info['data']['train']
        train['ensemble'] = cp.asarray(self.model.predict(train['features_cpu']))

        valid = build_info['data']['valid']
        valid['ensemble'] = [cp.asarray(self.model.predict(x)) for x in valid['features_cpu']]

        self.model.to_cpu()

        return 

    def after_train(self, build_info):

        build_info['model'].models = self.model.models + build_info['model'].models
        # update the actual iteration
        build_info['num_iter'] = build_info['num_iter'] + len(self.model.models)
        # update the actual best round
        early_stop = build_info['model'].callbacks.callbacks[-1]
        early_stop.best_round = early_stop.best_round + len(self.model.models)

        # not to store old trees multiple times
        self.model = None

        return

# first, train a half of the model
model = GradientBoosting(...)
model.fit(...)
# do something
pass
# continue training
model = GradientBoosting(..., callbacks=[WarmStart(model)])
model.fit(...)

What about the DART booster mode, there are some problems. Theoretically I can write the similar (may be a little bit more complex) callback to imitate the DART mode, but I suppose it will not be efficient. The reason is that to implement it you have to store in memory all predictions from all of the trees were trained. Now, at each step only the full state of the ensemble and the last tree predictions are stored. Probably, for extremely multioutput task you will need a lot of memory and probably will not fit on GPU. So the only choice is to store it in RAM. But it means, that at each step a lot of CPU -> GPU data transfers will be performed and probably it will be very slow. The other way is - not to store the tree predction but perform the inference of dropped trees at each stage to substract from the ensemble state. It looks unefficient also. If it would be a killer feature, I suppose I did it anyway, but whenever I tried it, it never works for me, so I decide to ignore it. If you still want it, no matter how slow it will work, I can try to do it in some near time

btbpanda commented 1 year ago

@OswaldHongyu sorry again, I checked code and find out that WarmStart does not consider early stopping may be used. I updated my comment above and now it should work better