ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.49k stars 5.69k forks source link

Problem with YOLOv8 Hyperparameters tuning #42770

Open 2019331099-Rabbi opened 8 months ago

2019331099-Rabbi commented 8 months ago

What happened + What you expected to happen

I am running the notebook in Kaggle.

!pip install -U "ray[tune]"

from ray import train, tune
def objective(config):
    project = 'Test_KFold'

    model = YOLO('yolov8m.pt')
    dataset_yaml = yaml_paths[0]
    model.train(data=dataset_yaml,
                batch=16,
                project=project,
                epochs=1,
                verbose=False,
                workers=28,
    )
    result = model.metrics
    mAP50_95_value = result.results_dict.get('metrics/mAP50-95(B)')
    clear_output()
    shutil.rmtree('/kaggle/working/Test_KFold', ignore_errors=True)
    return {"score": mAP50_95_value} 

search_space = {
    'lr0': tune.uniform(1e-3, 1e-0),
    'lrf': tune.uniform(0.01, 1.0),
}

tuner = tune.Tuner(objective, param_space=search_space)
results = tuner.fit()
print(results.get_best_result(metric="score", mode="max").config)

When I run this the following errors generated

2024-01-27 04:04:09,187 ERROR services.py:1207 -- Failed to start the dashboard, return code -11
2024-01-27 04:04:09,190 ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2024-01-27 04:04:09,191 ERROR services.py:1276 -- 
The last 20 lines of /tmp/ray/session_2024-01-27_04-04-07_056118_26/logs/dashboard.log (it contains the error message from the dashboard): 
2024-01-27 04:04:09,140 INFO head.py:254 -- Starting dashboard metrics server on port 44227

Unexpected exception formatting exception. Falling back to standard exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/node.py", line 293, in __init__
    gcs_server_port = os.getenv(ray_constants.GCS_PORT_ENVIRONMENT_VARIABLE)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/services.py", line 459, in wait_for_node
    if node_plasma_store_socket_name in object_store_socket_names:
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2024-01-27_04-04-07_056118_26/sockets/plasma_store in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_26/2480291194.py", line 26, in <module>
    results = tuner.fit()
  File "/opt/conda/lib/python3.10/site-packages/ray/tune/tuner.py", line 347, in fit
    """Executes hyperparameter tuning job as configured and returns result.
  File "/opt/conda/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
    mode=self._tune_config.mode,
  File "/opt/conda/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 703, in _fit_internal
  File "/opt/conda/lib/python3.10/site-packages/ray/tune/tune.py", line 573, in run
    all_start = time.time()
  File "/opt/conda/lib/python3.10/site-packages/ray/tune/tune.py", line 225, in _ray_auto_init
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 1514, in init
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/node.py", line 298, in __init__
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2105, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1428, in structured_traceback
    return FormattedTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1319, in structured_traceback
    return VerboseTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1172, in structured_traceback
    formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1087, in format_exception_as_a_whole
    frames.append(self.format_record(record))
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 969, in format_record
    frame_info.lines, Colors, self.has_colors, lvals
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 792, in lines
    return self._sd.lines
  File "/opt/conda/lib/python3.10/site-packages/stack_data/utils.py", line 144, in cached_property_wrapper
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/opt/conda/lib/python3.10/site-packages/stack_data/core.py", line 734, in lines
    pieces = self.included_pieces
  File "/opt/conda/lib/python3.10/site-packages/stack_data/utils.py", line 144, in cached_property_wrapper
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/opt/conda/lib/python3.10/site-packages/stack_data/core.py", line 681, in included_pieces
    pos = scope_pieces.index(self.executing_piece)
  File "/opt/conda/lib/python3.10/site-packages/stack_data/utils.py", line 144, in cached_property_wrapper
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/opt/conda/lib/python3.10/site-packages/stack_data/core.py", line 660, in executing_piece
    return only(
  File "/opt/conda/lib/python3.10/site-packages/executing/executing.py", line 190, in only
    raise NotOneValueFound('Expected one value, found 0')
executing.executing.NotOneValueFound: Expected one value, found 0

Versions / Dependencies

Pillow==9.5.0 numpy==1.24.3 pandas==2.0.3 ray==2.6.3 scikit-learn==1.2.2 tqdm==4.66.1

Reproduction script

Install and update Ultralytics and Ray Tune packages

!pip install -U ultralytics "ray[tune]"

Optionally install W&B for logging

!pip install -U wandb !pip install -U ipywidgets

from ultralytics import YOLO

Load a YOLOv8n model

model = YOLO('yolov8n.pt')

Start tuning hyperparameters for YOLOv8n training on the COCO8 dataset

result_grid = model.tune(data='coco8.yaml', use_ray=True)

Issue Severity

High: It blocks me from completing my task.

matthewdeng commented 8 months ago

Taking a look at this it looks like an issue setting up the Ray Cluster.

Can you try just running a script that calls ray.init()?

rynewang commented 8 months ago

yeah looks like Ray did not start up. Can you try these:

  1. add import ray then ray.init() at the beginning,
  2. find logs /tmp/ray/session_latest/logs/raylet.err and post it here?
anyscalesam commented 8 months ago

@2019331099-Rabbi did you get a chance to try out @rynewang 's advice?