ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[tune] Resume fails when using Repeater if last trial before resume is incomplete #11918

Open ag-tcm opened 3 years ago

ag-tcm commented 3 years ago

What is the problem?

Using tune.suggest.Repeater does not work when trying to resume a search and leads to an error like this upon resume:

2020-11-10 11:36:11,185 ERROR repeater.py:159 -- Trial 20f0571e not in group; cannot report score. Seen trials: ['04662968']
2020-11-10 11:36:11,186 ERROR trial_runner.py:794 -- Trial easy_objective_20f0571e: Error processing event.
Traceback (most recent call last):
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 747, in _process_trial
    trial.trial_id, result=flat_result)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\search_generator.py", line 156, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\repeater.py", line 160, in on_trial_complete
    trial_group = self._trial_id_to_group[trial_id]
KeyError: '20f0571e'
2020-11-10 11:36:11,220 ERROR repeater.py:159 -- Trial 20f0571e not in group; cannot report score. Seen trials: ['04662968']
Traceback (most recent call last):
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 747, in _process_trial
    trial.trial_id, result=flat_result)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\search_generator.py", line 156, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\repeater.py", line 160, in on_trial_complete
    trial_group = self._trial_id_to_group[trial_id]
KeyError: '20f0571e'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tune_test.py", line 77, in <module>
    **tune_kwargs)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\tune.py", line 424, in run
    runner.step()
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 570, in step
    self._process_events()  # blocking
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 711, in _process_events
    self._process_trial(trial)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 797, in _process_trial
    self._process_trial_failure(trial, traceback.format_exc())
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\trial_runner.py", line 905, in _process_trial_failure
    self._search_alg.on_trial_complete(trial.trial_id, error=True)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\search_generator.py", line 156, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python37\lib\site-packages\ray\tune\suggest\repeater.py", line 160, in on_trial_complete
    trial_group = self._trial_id_to_group[trial_id]
KeyError: '20f0571e'

Ray version and other system information (Python version, TensorFlow version, OS): Ray Version: 1.1.0.dev0

Reproduction (REQUIRED)

Run the script and let it run for a few trials. Then interrupt it. Uncomment the "resume" parameter to let resume be true. Start the script again and it will produce an error like the one above.

"""This test checks that HyperOpt is functional.

It also checks that it is usable with a separate scheduler.
"""
import time

import ray
from ray import tune
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.hyperopt import HyperOptSearch

def evaluation_fn(step, width, height):
    return (0.1 + width * step / 100)**(-1) + height * 0.1

def easy_objective(config):
    # Hyperparameters
    width, height = config["width"], config["height"]

    for step in range(config["steps"]):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back back to Tune.
        tune.report(iterations=step, mean_loss=intermediate_score)
        time.sleep(0.1)

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args, _ = parser.parse_known_args()
    ray.init(include_dashboard=False, local_mode=True, num_cpus=1)

    current_best_params = [
        {
            "width": 1,
            "height": 2,
            "activation": 0  # Activation will be relu
        },
        {
            "width": 4,
            "height": 2,
            "activation": 1  # Activation will be tanh
        }
    ]

    tune_kwargs = {
        "num_samples": 10 if args.smoke_test else 1000,
        "config": {
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            # This is an ignored parameter.
            "activation": tune.choice(["relu", "tanh"])
        }
    }
    algo = HyperOptSearch(points_to_evaluate=current_best_params)

    from ray.tune.suggest import Repeater
    algo = Repeater(algo, repeat=10)

    tune.run(
        easy_objective,
        name='easy_objective',
        search_alg=algo,
        metric="mean_loss",
        mode="min",
        sync_to_driver=False,
        fail_fast=False,
        log_to_file=True,
        #resume=True,
        **tune_kwargs)
richardliaw commented 3 years ago

Hmm, i think the reason is save/restore isn't yet implemented for the Repeater. This should be an easy thing to implement; would you be interested in making a contribution?

ag-tcm commented 3 years ago

Hey, sorry, I would like to but do not have the time to contribute right now.

apatel4746 commented 1 month ago

I found this issue on https://ovio.org/projects and would love to contribute! It seems like its been a few years so I was wondering if this issue has been resolved already.