microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.55k stars 3.82k forks source link

Feature Request: add timeout parameter to the .fit() method #6596

Open fingoldo opened 1 month ago

fingoldo commented 1 month ago

Adding the timeout parameter to the .fit() method, that should force the library to return best known solution found so far as soon as provided number of seconds since the start of training are passed, will allow to satisfy training SLAs, when a user has only a limited time budget to finish certain model training. Also, this will make possible fair comparison of different hyperparameters.

Reaching the timeout should have the same effect as reaching max iterations, maybe with additional warning and/or attribute set so that the training job's finishing reason is clear to the end user.

jameslamb commented 1 month ago

Thanks for using LightGBM and taking the time to open this.

I'm -1 on adding this to LightGBM. I understand why this might be useful, but I don't think LightGBM is the right place for this logic. This would introduce some non-trivial maintenance burden and complexity.

This would be better handled outside of LightGBM, in code that you use to invoke it.

Since you mentioned .fit(), I assume you're specifically talking about using lightgbm (the Python package for LightGBM). You could, for example, use asyncio's builtin support for timing out Python function calls: https://docs.python.org/3/library/asyncio-task.html#timeouts.

Alternatively, you could use a lightgbm callback for this purpose. Something like the following:

import lightgbm as lgb
from datetime import datetime
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=20)
dtrain = lgb.Dataset(X, label=y)

class TimeoutCallback:
    def __init__(self, timeout_seconds: int):
        self.before_iteration = False
        self.timeout_seconds = timeout_seconds
        self._start = datetime.utcnow()

    def __call__(self, *args, **kwargs) -> None:
        if (datetime.utcnow() - self._start).total_seconds() > self.timeout_seconds:
            raise RuntimeError(f"timing out: elapsed time has exceeded {self.timeout_seconds} seconds")

bst = lgb.train(
    params={
        "objective": "regression",
        "num_leaves": 100
    },
    train_set=dtrain,
    num_boost_round=1000,
    callbacks=[TimeoutCallback(2)]
)

I just tested that with LightGBM 4.5.0 and saw the following:

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001736 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 20
[LightGBM] [Info] Start training from score 0.256686
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/engine.py", line 317, in train
    cb(
  File "<stdin>", line 8, in __call__
RuntimeError: timing out: elapsed time has exceeded 2 seconds

That's not perfect, as it only runs after each iteration and individual iterations could run for much longer on a realistic dataset. But hopefully that imperfection also shows one example of how complex this would be to implement in LightGBM.

I'm only one vote here though, maybe other maintainers will have a different perspective.

fingoldo commented 1 month ago

I did not think of this approach! If i'm using early stopping, are best "weights" applied to the model after this exception is thrown? in other words, is best_iter set correctly? Goal would be to stay within time budget but not lose training progress up to the point.

jameslamb commented 1 month ago

Oh interesting! It wasn't clear to me that you would want to see training time out but also keep that model.

No, in the Python package best_iter and other early stopping behavior is only set after early stopping is explicitly triggered, not along the way as training proceeds.

A Python exception is used to tell the training process that early stopping has been triggered, and to carry forward details like best iteration and evaluation results.

https://github.com/microsoft/LightGBM/blob/e7edb6cb1894f0d3847c3eaf3df9fc5a6b2414f9/python-package/lightgbm/callback.py#L436

https://github.com/microsoft/LightGBM/blob/e7edb6cb1894f0d3847c3eaf3df9fc5a6b2414f9/python-package/lightgbm/callback.py#L40-L44

https://github.com/microsoft/LightGBM/blob/e7edb6cb1894f0d3847c3eaf3df9fc5a6b2414f9/python-package/lightgbm/engine.py#L327-L330

You could rely on that behavior in your own callback, and have it raise a lightgbm.EarlyStopException instead of a RuntimeError like in my example. That'd allow you to treat "training has been running for too long" as a triggering condition for early stopping.

Alternatively... have you tried optuna? I haven't used this particular feature of it, but it looks like they directly offer a time_budget: https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.integration.lightgbm.LightGBMTuner.html

time_budget – A time budget for parameter tuning in seconds.

(that might be for the entire experiment though, not per-trial... I'm not sure)

fingoldo commented 1 month ago

Hah! ) I'm planning to create my own cool hyperparameters tuner, that's one of the reasons why I'm interested in this functionality. I can easily see how to do time budgeting at the level of the tuner - just in the hyperparameters checking loop, after next combination has been tried, but the underlying estimator has to finish its training gracefully before that, which for some combinations can take extremely long time.

Writing great hyperparameters optimizer is one more use case for this timeout feature. Now I think it's the EarlyStopping callback I should subclass (as I almost can't imagine training without early stopping).

Does it make sense to prepare a PR that adds timeout parameter to the EarlyStopping callback?

That said, it still seems more natural to me to be able to specify timeout in the fit or init methods of the estimator directly, same as we do with n_iters, just in this case we are interested in maximum number of seconds not trees.