ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.78k stars 5.75k forks source link

[bug] cannot run with too many parameters #29586

Closed Rane90 closed 2 years ago

Rane90 commented 2 years ago

What happened + What you expected to happen

While running a simple lightgbm cv hpo script taken from your code repo, ray crashed giving the following error:

Microsoft Windows [Version 10.0.22000.1098]
(c) Microsoft Corporation. All rights reserved.

C:\work\ray>C:/Users/Ran/miniconda3/Scripts/activate

(base) C:\work\ray>conda activate C:\Users\Ran\anaconda3\envs\ray

(ray) C:\work\ray>C:/Users/Ran/anaconda3/envs/ray/python.exe c:/work/ray/lightgbm_with_cv.py
2022-10-23 11:54:50,377 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2022-10-23 11:54:52,463 WARNING function_trainable.py:619 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
Traceback (most recent call last):
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 58, in open_file
    factory = REGISTERED_FACTORIES[prefix]
KeyError: 'C'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 819, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 909, in _on_pg_ready
    if not _start_trial(next_trial) and next_trial.status != Trial.ERROR:
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 901, in _start_trial
    self._callbacks.on_trial_start(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\callback.py", line 317, in on_trial_start
    callback.on_trial_start(**info)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\logger\logger.py", line 135, in on_trial_start
    self.log_trial_start(trial)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\logger\tensorboardx.py", line 179, in log_trial_start
    self._trial_writer[trial] = self._summary_writer_cls(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 301, in __init__
    self._get_file_writer()
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 349, in _get_file_writer
    self.file_writer = FileWriter(logdir=self.logdir,
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 105, in __init__
    self.event_writer = EventFileWriter(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\event_file_writer.py", line 106, in __init__
    self._ev_writer = EventsWriter(os.path.join(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\event_file_writer.py", line 43, in __init__
    self._py_recordio_writer = RecordWriter(self._file_name)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 179, in __init__
    self._writer = open_file(path)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 61, in open_file
    return open(path, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Ran\\ray_results\\train_breast_cancer_2022-10-23_11-54-47\\train_breast_cancer_5bb4f_00000_0_boosting_type=gbdt,learning_rate=0.0014,max_depth=22,min_child_samples=786,n_estimators=3142,num_2022-10-23_11-54-52\\events.out.tfevents.1666515292.DESKTOP-3P4PECB'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\tuner.py", line 234, in fit
    return self._local_tuner.fit()
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\impl\tuner_internal.py", line 283, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\impl\tuner_internal.py", line 380, in _fit_internal
    analysis = run(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\tune.py", line 722, in run
    runner.step()
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 872, in step
    self._wait_and_handle_event(next_trial)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 851, in _wait_and_handle_event
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 58, in open_file
    factory = REGISTERED_FACTORIES[prefix]
KeyError: 'C'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 819, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 909, in _on_pg_ready
    if not _start_trial(next_trial) and next_trial.status != Trial.ERROR:
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\execution\trial_runner.py", line 901, in _start_trial
    self._callbacks.on_trial_start(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\callback.py", line 317, in on_trial_start
    callback.on_trial_start(**info)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\logger\logger.py", line 135, in on_trial_start
    self.log_trial_start(trial)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\logger\tensorboardx.py", line 179, in log_trial_start
    self._trial_writer[trial] = self._summary_writer_cls(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 301, in __init__
    self._get_file_writer()
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 349, in _get_file_writer
    self.file_writer = FileWriter(logdir=self.logdir,
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\writer.py", line 105, in __init__
    self.event_writer = EventFileWriter(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\event_file_writer.py", line 106, in __init__
    self._ev_writer = EventsWriter(os.path.join(
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\event_file_writer.py", line 43, in __init__
    self._py_recordio_writer = RecordWriter(self._file_name)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 179, in __init__
    self._writer = open_file(path)
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\tensorboardX\record_writer.py", line 61, in open_file
    return open(path, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Ran\\ray_results\\train_breast_cancer_2022-10-23_11-54-47\\train_breast_cancer_5bb4f_00000_0_boosting_type=gbdt,learning_rate=0.0014,max_depth=22,min_child_samples=786,n_estimators=3142,num_2022-10-23_11-54-52\\events.out.tfevents.1666515292.DESKTOP-3P4PECB'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:/work/ray/lightgbm_with_cv.py", line 139, in <module>
    results = tuner.fit()
  File "C:\Users\Ran\anaconda3\envs\ray\lib\site-packages\ray\tune\tuner.py", line 236, in fit
    raise TuneError(
ray.tune.error.TuneError: Tune run failed. Please use tuner = Tuner.restore("C:\Users\Ran\ray_results\train_breast_cancer_2022-10-23_11-54-47") to resume.

I assume that this happens because the directory name is too long, and therefore it cannot be created.

Is there any "elegant" way to solve this? lightgbm cannot take other parameter names.

Versions / Dependencies

These are my dependencies:

Package                  Version
------------------------ ---------
aiohttp                  3.8.3
aiohttp-cors             0.7.0
aiosignal                1.2.0
ansicon                  1.89.0
async-timeout            4.0.2
attrs                    22.1.0
autopep8                 1.7.0
blessed                  1.19.1
cachetools               5.2.0
cca-zoo                  1.13.2
certifi                  2022.9.24
charset-normalizer       2.1.1
click                    8.0.4
colorama                 0.4.5
colorful                 0.5.4
contourpy                1.0.5
cycler                   0.11.0
distlib                  0.3.6
filelock                 3.8.0
fonttools                4.37.2
frozenlist               1.3.1
google-api-core          2.10.2
google-auth              2.13.0
googleapis-common-protos 1.56.4
gpustat                  1.0.0
grpcio                   1.43.0
idna                     3.4
importlib-resources      5.10.0
iniconfig                1.1.1
jinxed                   1.2.0
joblib                   1.2.0
jsonschema               4.16.0
kiwisolver               1.4.4
lightgbm                 3.3.3
msgpack                  1.0.4
multidict                6.0.2
mvlearn                  0.5.0
nose                     1.3.7
numpy                    1.23.4
nvidia-ml-py             11.495.46
opencensus               0.11.0
opencensus-context       0.1.3
packaging                21.3
pandas                   1.5.1
Pillow                   9.2.0
pip                      22.2.2
pkgutil_resolve_name     1.3.10
platformdirs             2.5.2
pluggy                   1.0.0
prometheus-client        0.13.1
protobuf                 3.20.3
psutil                   5.9.3
py                       1.11.0
py-spy                   0.3.14
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycodestyle              2.9.1
pydantic                 1.10.2
pyparsing                3.0.9
pyrsistent               0.18.1
pytest                   7.1.3
python-dateutil          2.8.2
pytz                     2022.5
PyYAML                   6.0
ray                      2.0.0
requests                 2.28.1
rsa                      4.9
scikit-learn             1.1.2
scipy                    1.9.3
setuptools               63.4.1
six                      1.16.0
smart-open               6.2.0
tabulate                 0.9.0
tensorboardX             2.5.1
tensorly                 0.7.0
threadpoolctl            3.1.0
toml                     0.10.2
tomli                    2.0.1
typing_extensions        4.4.0
urllib3                  1.26.12
virtualenv               20.16.5
wcwidth                  0.2.5
wheel                    0.37.1
wincertstore             0.2
yarl                     1.8.1
zipp                     3.9.0

Reproduction script

This is the script, the only thing I've changed is the parameters for the optimization:

import lightgbm as lgb
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import train_test_split

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.lightgbm import (
    TuneReportCheckpointCallback,
    TuneReportCallback,
)

def train_breast_cancer(config: dict):
    # This is a simple training function to be passed into Tune

    # Load dataset
    data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)

    # Split into train and test set
    train_x, test_x, train_y, test_y = train_test_split(
        data, target, test_size=0.25)

    # Build input Datasets for LightGBM
    train_set = lgb.Dataset(train_x, label=train_y)
    test_set = lgb.Dataset(test_x, label=test_y)

    # Train the classifier, using the Tune callback
    lgb.train(
        config,
        train_set,
        valid_sets=[test_set],
        valid_names=["eval"],
        verbose_eval=False,
        callbacks=[
            TuneReportCheckpointCallback(
                {
                    "binary_error": "eval-binary_error",
                    "binary_logloss": "eval-binary_logloss",
                }
            )
        ],
    )

def train_breast_cancer_cv(config: dict):
    # This is a simple training function to be passed into Tune, using
    # lightgbm's cross validation functionality

    # Load dataset
    data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)

    train_set = lgb.Dataset(data, label=target)

    # Run CV, using the Tune callback
    lgb.cv(
        config,
        train_set,
        verbose_eval=False,
        stratified=True,
        # Checkpointing is not supported for CV
        # LightGBM aggregates metrics over folds automatically
        # with the cv_agg key. Both mean and standard deviation
        # are provided.
        callbacks=[
            TuneReportCallback(
                {
                    "binary_error": "cv_agg-binary_error-mean",
                    "binary_logloss": "cv_agg-binary_logloss-mean",
                    "binary_error_stdv": "cv_agg-binary_error-stdv",
                    "binary_logloss_stdv": "cv_agg-binary_logloss-stdv",
                },
            )
        ],
    )

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--server-address",
        type=str,
        default=None,
        required=False,
        help="The address of server to connect to if using Ray Client.",
    )
    parser.add_argument(
        "--use-cv", action="store_true", help="Use `lgb.cv` instead of `lgb.train`."
    )
    args, _ = parser.parse_known_args()

    if args.server_address:
        import ray

        ray.init(f"ray://{args.server_address}")

    # config = {
    #     "objective": "binary",
    #     "metric": ["binary_error", "binary_logloss"],
    #     "verbose": -1,
    #     "boosting_type": tune.grid_search(["gbdt", "dart"]),
    #     "num_leaves": tune.randint(10, 100),
    #     "learning_rate": tune.loguniform(0.01, 0.5),
        # "max_depth": tune.randint(1, 30),
    #     "n_estimators": tune.randint(1, 10000),
    #     "min_child_samples": tune.randint(2, 1000),
    #     "min_child_weight": tune.loguniform(1e-5, 1e4),
    #     "colsample_bytree": tune.uniform(0.1, 1),
    #     "reg_alpha": tune.loguniform(1e-5, 100),
    #     "reg_alpha": tune.loguniform(1e-5, 100),
    # }

    config = {
        "objective": "binary",
        "metric": ["binary_error", "binary_logloss"],
        "verbose": -1,
        "boosting_type": tune.grid_search(["gbdt", "dart"]),
        "num_leaves": tune.randint(10, 1000),
        "learning_rate": tune.loguniform(1e-8, 1e-1),
        "max_depth": tune.randint(1, 30),
        "n_estimators": tune.randint(1, 10000),
        "min_child_samples": tune.randint(2, 1000),
        # "min_child_weight": tune.loguniform(1e-5, 1e4),
        # "colsample_bytree": tune.uniform(0.1, 1),
    }

    tuner = tune.Tuner(
        train_breast_cancer if not args.use_cv else train_breast_cancer_cv,
        tune_config=tune.TuneConfig(
            metric="binary_error",
            mode="min",
            num_samples=2,
            scheduler=ASHAScheduler(),
        ),
        param_space=config,
    )
    results = tuner.fit()

    print("Best hyperparameters found were: ",
          results.get_best_result().config)

Issue Severity

High: It blocks me from completing my task.

Rane90 commented 2 years ago

I've managed to find some "workaround" by setting - TUNE_DISABLE_AUTO_CALLBACK_LOGGERS to 1:

import os
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"

Hope this helps others

jbedorf commented 2 years ago

You can also use a custom function that creates a directory name that is shorter. By setting the trial_dirname_creator of the tune.run call. Not sure if you can specify that via the Tuner class though

bveeramani commented 2 years ago

Hey @jbedorf, thanks for opening an issue!

As you mentioned, I suspect the issue has to do with Window's 260 character path limit.

By setting the trial_dirname_creator of the tune.run call. Not sure if you can specify that via the Tuner class though

You can set trial_dirname_creator with the _tuner_kwargs parameter. That said, this is an implementation detail -- it might be removed without warning.

tuner = tune.Tuner(
    objective,

    ...

    _tuner_kwargs={
        "trial_dirname_creator": trial_dirname_creator
    }
)
bveeramani commented 2 years ago

Closing this issue because you've found a workaround. Feel free to re-open it to continue the discussion!