ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.43k stars 5.67k forks source link

[tune]Error in BOHB perhaps caused by different trainable instances running in the same Trial ??? #8455

Open kakakflo22thy opened 4 years ago

kakakflo22thy commented 4 years ago

What is the problem?

Ray version 0.8.4 Error as follow

Failure # 1 (occurred at 2020-05-15_15-13-07)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 491, in _process_trial
    self, trial, flat_result)
  File "/usr/local/lib/python3.6/site-packages/ray/tune/schedulers/hb_bohb.py", line 75, in on_trial_result
    bracket.update_trial_stats(trial, result)
  File "/usr/local/lib/python3.6/site-packages/ray/tune/schedulers/hyperband.py", line 387, in update_trial_stats
    assert delta >= 0, (result, self._live_trials[trial])
AssertionError: ({'y1-y2': 4.1352455559939285, 'rl_actions_kl': 4.220735888312163, 'effi_proms': 0.08549033231823473, 'done': False, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 1, 'experiment_id': 'da43b2881ee941f5845472dc4f2c4e93', 'date': '2020-05-15_15-13-07', 'timestamp': 1589526787, 'time_this_iter_s': 704.6713151931763, 'time_total_s': 704.6713151931763, 'pid': 2363017, 'hostname': 'ray-head-7554f464d9-xztj9', 'node_ip': '10.220.177.182', 'time_since_restore': 704.6713151931763, 'timesteps_since_restore': 0, 'iterations_since_restore': 1, 'config/mmd_epsilon': 2.4350544433433587e-05, 'config/sim_data_ratio': 0.3643946485499347, 'config/cost_epsilon': 0.00209137209284687, 'hyperband_info': {}}, {'y1-y2': 2.754341517953902, 'rl_actions_kl': 2.826289866474546, 'effi_proms': 0.0719483485206438, 'done': False, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 2, 'experiment_id': '5114d0c5bf68418881d70f3dfb48c829', 'date': '2020-05-15_14-55-37', 'timestamp': 1589525737, 'time_this_iter_s': 571.9673192501068, 'time_total_s': 1298.906806230545, 'pid': 2363670, 'hostname': 'ray-head-7554f464d9-xztj9', 'node_ip': '10.220.177.182', 'time_since_restore': 1298.906806230545, 'timesteps_since_restore': 0, 'iterations_since_restore': 2, 'config/mmd_epsilon': 2.4350544433433587e-05, 'config/sim_data_ratio': 0.3643946485499347, 'config/cost_epsilon': 0.00209137209284687, 'hyperband_info': {'budget': 2}})*

I've looked through source code and found that delta is calculated from delta = self._get_result_time(result) - \ self._get_result_time(self._live_trials[trial]) in which _get_result_time is used to retrive result[self._time_attr]. I set self._time_attr as trainning_iteration in my script which is supposed to be increasing monotonically during trial running.

However, AssertionError shows that the latest trainning_iterationis 1, which means as I understand, a new trainable instance has started, while the lastest recorded result by self._live_trials shows trainning_iterationis 4 meaning that a trainable instance has been running or ran in the same trial. experiment_id proves that two records above comes from two different experiments with first 'experiment_id':'da43b2881ee941f5845472dc4f2c4e93' and second 'experiment_id':'5114d0c5bf68418881d70f3dfb48c829'.

I feel confused that how two different experiments run in one same trial. I changed parameters several times and it reproduced this issue.

richardliaw commented 4 years ago

It might be that a trial was not saved correctly.. Can you post a reproducible script?

kakakflo22thy commented 4 years ago

Sorry that I can't post core code here. But I find that in checkpoint dir .tune_metadata data is missing, here is the _save and _restorefunction, is there any help?

   def _save(self, model_path):
        print("_save")
        actor_filepath =  model_path + '/actor.h5'
        reward_critic_filepath =  model_path + '/reward_critic.h5' 
        cost_critic_filepath =  model_path + '/cost_critic.h5' 
        self.vaemmd_test.actor.save_model(actor_filepath)
        self.vaemmd_test.reward_critic.save_model(reward_critic_filepath)
        self.vaemmd_test.cost_critic.save_model(cost_critic_filepath)
        return model_path

    def _restore(self, model_path):
        print("_restore")
        actor_filepath =  model_path + '/actor.h5'
        reward_critic_filepath =  model_path + '/reward_critic.h5' 
        cost_critic_filepath =  model_path + '/cost_critic.h5' 
        self.vaemmd_test.actor.load_model(actor_filepath, custom_objects={'LOG_SIG_CAP_MIN': LOG_SIG_CAP_MIN,
                                                                               'LOG_SIG_CAP_MAX': LOG_SIG_CAP_MAX,
                                                                               'tf': tf})
        self.vaemmd_test.reward_critic.load_model(reward_critic_filepath)
        self.vaemmd_test.cost_critic.load_model(cost_critic_filepath)

It turn out that only files below exist image

kakakflo22thy commented 4 years ago

I changed my code as below and this time it output model.tune_metadata file, but still got the same error...

    def _save(self, model_path):
        print("_save")
        actor_filepath =  model_path + '/actor.h5'
        reward_critic_filepath =  model_path + '/reward_critic.h5' 
        cost_critic_filepath =  model_path + '/cost_critic.h5' 
        self.vaemmd_test.actor.save_model(actor_filepath)
        self.vaemmd_test.reward_critic.save_model(reward_critic_filepath)
        self.vaemmd_test.cost_critic.save_model(cost_critic_filepath)
        return model_path +'/model'

    def _restore(self, model_path):
        print("_restore")
        print(model_path)
        model_path_pre, _ = model_path.split('/')
        actor_filepath =  model_path_pre + '/actor.h5'
        reward_critic_filepath =  model_path_pre + '/reward_critic.h5' 
        cost_critic_filepath =  model_path_pre + '/cost_critic.h5' 
        self.vaemmd_test.actor.load_model(actor_filepath, custom_objects={'LOG_SIG_CAP_MIN': LOG_SIG_CAP_MIN,
                                                                               'LOG_SIG_CAP_MAX': LOG_SIG_CAP_MAX,
                                                                               'tf': tf})
        self.vaemmd_test.reward_critic.load_model(reward_critic_filepath)
        self.vaemmd_test.cost_critic.load_model(cost_critic_filepath)