Env crashes with IndexError, when using random actions

DominikRoB commented 2 years ago

Occasionally my environment (and thus my ray workers) crash at the beginning of training.

I observed two cases so far:

IndexError: pop from empty list [next_time_step_to_pick = self.next_time_step.pop(0)]
IndexError: index 15 is out of bounds for axis 0 with size 15 [time_needed = self.instance_matrix[action][current_time_step_job][1]]

Obviously the random actions steer the environment into a bad place.

How did you handle this during your own training? Currently I can't train my agents, because they crash when the env crashes. (Im using Ray[rllib])

Code to reproduce the error:


env_config = {"instance_path": "\\instances\\ta20"}
env = gym.make('JSSEnv-v1', env_config=env_config)

obs = env.reset()
while True:
    action = env.action_space.sample()
    obs, reward, done, _ = env.step(action)
    env.render()
    if done:
        print("Episode ended")
        break
env.close()```

DominikRoB commented 2 years ago

Running ray.tune.run(..) with max_failures=-1 helps, but spends lot of time with failure runs :(

ingambe commented 2 years ago

Hi,

The behavior you observed is normal. As the environment contains illegal action depending on the state, you have to sample for the legal action vector. Please refer to this discussion: https://github.com/prosysscience/JSSEnv/issues/6#issuecomment-1083038618

Using parametric action space with RLLib requires to use of a network that can mask such actions: https://docs.ray.io/en/latest/rllib/rllib-models.html#variable-length-parametric-action-spaces

DominikRoB commented 2 years ago

Hey, thanks for your answer!

I didn't saw the discussion, thanks for the hint - it looks like it can help me.

I tried to make the environment more internal stable, so that it can handle any random action. (I'd like to use the action masking as a speed-up, not as the only way to make it run) Not sure how much I succeeded yet, because I run into some very long iterations in ray; seeing increases in the reward, though

Nevertheless: The IndexOutOfBounds-Error is indeed a result of setting the action_space too large. It should be equal to self.jobs, not self.jobs+1 ( [0,self.jobs -1] are the jobs, self.jobs is the Nope-Action, right? )

I avoided the other error by adding len(self.next_time_step) > 0 before self._increase_time_step() is called, when handling the nope-action.. Will this lead to bad consequences in the long run?

PS: Good work with the paper and the code! I'm very grateful that you went out of your way to publish your work here.

ingambe commented 2 years ago

Thanks a lot for the interest and the compliment ;)

I didn't saw the discussion, thanks for the hint - it looks like it can help me.

I've included this in the README, I hope it clarify how to use it

I tried to make the environment more internal stable, so that it can handle any random action. (I'd like to use the action masking as a speed-up, not as the only way to make it run) Not sure how much I succeeded yet, because I run into some very long iterations in ray; seeing increases in the reward, though

I don't recommend doing so, mainly because if you allow the agent to take random actions, it makes the problem harder. Not only do you have to learn to schedule, but you also now have to learn to distinguish between legal and illegal actions. The traditional way to handle illegal actions is to mask their logits before the Softmax, this set their probabilities to near 0 I recommend this nice article exploring action masking and its impact: https://arxiv.org/abs/2006.14171

Nevertheless: The IndexOutOfBounds-Error is indeed a result of setting the action_space too large. It should be equal to self.jobs, not self.jobs+1 ( [0,self.jobs -1] are the jobs, self.jobs is the Nope-Action, right? )

Indeed, the environment allows for a Nope action, you're correct [0,self.jobs -1] are the jobs, and the last action is the Nope action

I avoided the other error by adding len(self.next_time_step) > 0 before self._increase_time_step() is called, when handling the nope-action.. Will this lead to bad consequences in the long run?

In theory, this shouldn't have any impact as the environment checks it before allowing actions in the legal action vector. I tried to limit the number of if as I was targeting performance in my original work. But if you feel safer to have them, it is perfectly okay ;)

prosysscience / JSSEnv

Env crashes with IndexError, when using random actions #9