ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

[Tune] Duplicated directories Using train.RunConfig #43469

Open Songloading opened 8 months ago

Songloading commented 8 months ago

What happened + What you expected to happen

Hi there, I'm using ray.tune.Tuner to do some tuning jobs on my model. My case is that my trainable is just a skeleton so I can substitute models inside it based on different scenarios. However, I am not sure which model to use in the runtime so I will have to input all the parameters for all potential models and use a variable to control which model and its following parameters to go. This is the background. Now what I'm doing is that, say I have two different models, and I want to run them serially by instantiating two tuners and fit them. Everything goes well until I want to store the results to different paths using ray.train.RunConfig For example in the code snippet, I expect both directories to have 4 different sub-directories containing different results, but I do see the second directory containing the sub-directories that are supposed to be in the first one. I do expect the results to be in different directiories. Is there anything I am missing here?

Versions / Dependencies

OS: Rocky Linux 9.3
Python: 3.11.8
ray: 2.9.3

Reproduction script

import ray
from ray import tune
import time
hparams_space_1 = {
    "x": tune.grid_search([0, 1]),
    "y": tune.grid_search([0, 1]),
}
hparams_space_2 = {
    "m": tune.grid_search([0, 1]),
    "n": tune.grid_search([0, 1]),
}
def trial_str_creator(trial):
    print(trial.trial_id)
    return trial.trial_id
@ray.remote(num_returns=2)
def get_score(config, ctl):
    if ctl:
        return config["x"] ** 2, config["y"] * -1
    else:
        return config["m"] ** 2, config["n"] * -1
def run_trial(config, ctl):
    score, score2 = ray.get(get_score.remote(config, ctl))
    ret = {'score':score, 'score2':score2}
    ray.train.report(ret)
if __name__ == "__main__":
    Tuner1 = tune.Tuner(
        tune.with_parameters(tune.with_resources(trainable=run_trial, resources=tune.PlacementGroupFactory([{}, {"CPU": 2}])),
        ctl=1),
        param_space=hparams_space_1,
        run_config=ray.train.RunConfig(storage_path="~/tmp_result1"),
        # tune_config=ray.tune.TuneConfig(trial_name_creator=trial_str_creator, trial_dirname_creator=trial_str_creator)
    )
    Tuner2 = tune.Tuner(
        tune.with_parameters(tune.with_resources(trainable=run_trial, resources=tune.PlacementGroupFactory([{}, {"CPU": 2}])),
        ctl=0),
        param_space=hparams_space_2,
        run_config=ray.train.RunConfig(storage_path="~/tmp_result2"),
        # tune_config=ray.tune.TuneConfig(trial_name_creator=trial_str_creator, trial_dirname_creator=trial_str_creator)
    )
    Tuner1.fit()
    Tuner2.fit()

directory 1(results are expected):

[user] ~/tmp_result1/run_trial_2024-02-27_14-41-21% ls
 basic-variant-state-2024-02-27_14-41-28.json  'run_trial_93cfd_00000_0_x=0,y=0_2024-02-27_14-41-28'  'run_trial_93cfd_00002_2_x=0,y=1_2024-02-27_14-41-28'   tuner.pkl
 experiment_state-2024-02-27_14-41-28.json     'run_trial_93cfd_00001_1_x=1,y=0_2024-02-27_14-41-28'  'run_trial_93cfd_00003_3_x=1,y=1_2024-02-27_14-41-28'

directory 2(results are NOT expected as it contains results from directory 1):

[user]~/tmp_result2/run_trial_2024-02-27_14-41-21% ls
 basic-variant-state-2024-02-27_14-41-28.json   experiment_state-2024-02-27_14-41-30.json             'run_trial_93cfd_00002_2_x=0,y=1_2024-02-27_14-41-28'  'run_trial_9668b_00001_1_m=1,n=0_2024-02-27_14-41-30'   tuner.pkl
 basic-variant-state-2024-02-27_14-41-30.json  'run_trial_93cfd_00000_0_x=0,y=0_2024-02-27_14-41-28'  'run_trial_93cfd_00003_3_x=1,y=1_2024-02-27_14-41-28'  'run_trial_9668b_00002_2_m=0,n=1_2024-02-27_14-41-30'
 experiment_state-2024-02-27_14-41-28.json     'run_trial_93cfd_00001_1_x=1,y=0_2024-02-27_14-41-28'  'run_trial_9668b_00000_0_m=0,n=0_2024-02-27_14-41-30'  'run_trial_9668b_00003_3_m=1,n=1_2024-02-27_14-41-30'

Issue Severity

High: It blocks me from completing my task.

Songloading commented 8 months ago

The problem seems somewhat related to 43092. but not exactly the same.

justinvyu commented 7 months ago

@Songloading This is a similar issue as https://github.com/ray-project/ray/issues/38522. The problem is that the local staging directory generates a datestring with the current time, and this collides when you have two tuners getting created one right after the other.

Can you change your code to create/call your 2 tuners consecutively?

tuner1 = Tuner(...)
tuner1.fit()

tuner2 = Tuner(...)
tuner2.fit()

Or, an alternative is to set a unique RunConfig(name) for each tuner, since they seem to correspond to different experiments.