Closed GiovanniGatti closed 2 years ago
FIY this issue is being redirect from https://discuss.ray.io/t/assert-agent-key-not-in-self-agent-collectors/1489/6.
The exception also raises with Ray 1.1.0 70_driver_log.txt
I'm running into the same error when using PPO for multi-agent RL (shared policy) with a increasing and then again decreasing number of agents over time. For me, the error only occurs with Ray 1.3 but not with version 1.0. It also does not occur if I have a constant number of agents (or only increasing, not decreasing).
Maybe this already helps a bit to narrow down the bug. If I figure out more details, I'll add them here.
It's a bit difficult to debug; even when running with a debugger and setting local_mode=True
in ray.init()
, the debugger only stops at the TuneError
"Trials did not complete", not at initial assertion error that caused the problem. So I can't see what the values are that cause this assertion error.
Any way to debug this effectively?
I think the error somehow comes from my environment removing agents during an episode. What I currently do is to simply remove the corresponding observations, info, done, reward for the "removed" agent. Do I need to actively remove/de-register the agent from RLlib somehow? @sven1977 Any idea?
For now, I'll roll back to v1.0 to temporarily fix the problem for me.
I also been having the same error. As far as I see, the error seems to occur when the length of the environment's episode is smaller than the 'rollout_fragment_length' parameter of algorithms such as PPO. It seems that whenever this occurs it's possible that a new episode starts with the same episode_id of the previous one (here, I refer to the 'episode_id' attribute of the Episode object which gets passed to the add_init_obs function). Then the agent_key variable, which is a tuple of the policy_id and episode_id, is the same from the previous episode and causes the error.
As I wrote before, I suspect that the issue is caused by the birthday paradox. Using self.episode_id: int = random.randrange(2e9)
for generating ids is a way too small range. The workaround I implemented (and is working fairly well from 2 months now) was to reduce the number of episodes generated at each training iteration.
If my theory is correct, it also explains the randomness of the error. The conflicting ids are a question of probability. It may never occur if the training is too short or the number of episodes per training iteration is too small. The only way to reliably reproduce it is to run a very long training with a lot of episodes in each training iteration.
If turns that this is a birthday paradox issue, I would suggest using a reasonably long hash generated with murmur
to episode ids.
@GiovanniGatti In my multi-agent scenario with first increasing, then decreasing number of active agents the error is reproducible - so it's not (just) due to random IDs from a too small range. In my case, there seems to be another issue.
@JSchapke Is it possible that this is somehow triggered by suspending previously active RL agents?
Has this been solved or worked on? I am facing the same problem now in a multi-agent setup but in add_action_reward_next_obs()
function, line 524 with same line assert agent_key in self.agent_collectors
and it happens basically in the second training iteration.
Error stack:
File "\venv\lib\site-packages\ray\_private\function_manager.py", line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "\venv\lib\site-packages\ray\util\iter.py", line 1158, in par_iter_next_batch
batch.append(self.par_iter_next())
File "\venv\lib\site-packages\ray\util\iter.py", line 1152, in par_iter_next
return next(self.local_it)
File "\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 332, in gen_rollouts
yield self.sample()
File "\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 720, in sample
batch = self.input_reader.next()
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 96, in next
batches = [self.get_data()]
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 223, in get_data
item = next(self.rollout_provider)
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 596, in _env_runner
_process_observations(
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 852, in _process_observations
sample_collector.add_action_reward_next_obs(
File "\venv\lib\site-packages\ray\rllib\evaluation\collectors\simple_list_collector.py", line 524, in add_action_reward_next_obs
assert agent_key in self.agent_collectors
AssertionError
I stumbled over the problem as well when trying to run the SAC Halfcheetah config from the tuned examples in the repo. Happened seemingly randomly after 762 iters.
rollout_fragment_length
is 1, speaking against your hypothesis, @JSchapke, and in favor of @GiovanniGatti 's theory.
I'm running Ray 1.3.0, so this might or might not have been fixed by now.
@vonHartz, I've been experiencing this issue regularly in Ray 1.4.0. I don't think it has been fixed by now because I don't see anything addressing it in the release notes.
As an workaround one can do: if the job fails and the reason is because assert key in self.agent_collects
, relaunch the training job from the latest checkpoint. In Python,
ray.init()
analysis = run_training()
while True:
relaunch = False
incomplete_trials = []
for trial in analysis.trials:
if trial.status == experiment_analysis.Trial.ERROR:
if 'assert agent_key not in self.agent_collectors' in trial.error_msg:
relaunch = True
else:
incomplete_trials.append(trial)
if incomplete_trials:
raise tune.TuneError("Trials did not complete", incomplete_trials)
if relaunch:
analysis = run_training() #here you can automatically detect and load the latest checkpoint with tune
continue
break
I'm not proud of it, but it works. In the worst case scenario, if you checkpoint regularly, you just lose a couple of training iterations.
I am running into this traceback, with IMPALA after 500000 episodes exactly, independent of episode length. I am running 100 workers, and weirdly, this issue only popped up after I went from 250>500 envs per worker.
edit: I was running >500k episodes with the 100 workers/250 envs per worker config.
As I wrote before, I suspect that the issue is caused by the birthday paradox. Using
self.episode_id: int = random.randrange(2e9)
for generating ids is a way too small range. The workaround I implemented (and is working fairly well from 2 months now) was to reduce the number of episodes generated at each training iteration.If my theory is correct, it also explains the randomness of the error. The conflicting ids are a question of probability. It may never occur if the training is too short or the number of episodes per training iteration is too small. The only way to reliably reproduce it is to run a very long training with a lot of episodes in each training iteration.
If turns that this is a birthday paradox issue, I would suggest using a reasonably long hash generated with
murmur
to episode ids.
I still have this issue every now and then in Ray 1.8. Since it only happens sporadically, I suspect @GiovanniGatti to be right (quoted). For now, I simply rerun the broken training runs, but would be great if there's a fix. I'll try to upgrade to Ray 1.12 and see if the issue persists.
Hello,
I run into the same issue in ray 1.12.1, so it has not been fixed. To me it happens just with a single agent environment.
What is the problem?
After a couple of training iterations, the training job crashes with the following error:
Python version: 3.7.9 OS: ubuntu 18.04 Tensorflow version: 2.4.1 Docker: Docker version 20.10.3, build 48d30b5
Logs: error.txt 70_driver_log.txt
I suspect that the error is caused by the following code:
I'm using a training batch that is generating ~2k episodes/iteration. Assuming that episode ids are independent per training iteration, the job has about (if I'm not wrong in my calculations) ~20% probability of generating at least one conflicting id in first 100 training iterations (due to the Birthday paradox).
Reproduction
I was able to reproduce the error with a the following script: