Enviornment setup needs to be changed.

2start commented 4 years ago

Currently, it seems like the rewards are somehow set up in a wrong way. The total reward does not seem to rise while the algorithms are training.

See Notebook

Brainstorming:

Possibly we need a negative reward when repairing already repaired stuff to incentivize short repair cycles.
Maybe less reward if a component is repaired at a later state.

christianadriano commented 4 years ago

@MrBanhBao Hao could you please answer this? There is a cumulative reward formula that we discussed that takes into consideration when the component was repaired. Components that were repaired earlier have a higher weight on the cumulative reward.

Repairing a component twice seems a bit counter-intuitive. i agree that we can add a negative reward to discourage that.

MrBanhBao commented 4 years ago

@2start @christianadriano I already stated this problems as a comment in this ticket. I am closing the the old one to prevent redundancy.

Regarding the negative reward. IMO it should be in a reasonable value range as the reward itself. For instance if we always get high rewards (9999999) and the punishment is just a static value of -1, in my imagination the agent would probably still make those unnecessary actions, because the punishment did not "hurt" enough. So the punishment should somehow be high but in respect of the value range of the possible rewards. But there is still the uncertainty of the epsilon-greedy policy that it takes with decreasing possibility some random action. This random action is sampled from the env's action space.

Since we have the knowledge that taking action (component, failure) more than once (if the repair was successful) is useless and unnecessary why don't we prevent those actions? It is now the question who should be responsible for this restriction? The environment itself or the agent? If it's the environment, the action space will change for each successful repair action. If it's the agent, the agent must ignore already taken actions to prevent useless actions. I'm fine with ether way. Probably it is better to shift this responsibility to the environment.

Regarding the decreasing impact of later repairs. We could implement following: g^timestep * reward. g must be a number between 0 and >1. g gets smaller with progressing time. But this introduces a additional hyperparameter besides the learning rate. I am just asking myself if it is possible to solve this problem just with the adjustment of the learning rate of q-learning. But what are your opinions?

2start commented 4 years ago

@MrBanhBao Yes, in retrospective it seems like a good idea for me to prohibit or punish the repeated repairs.

The idea to just use a dynamic action space for the environment seems sensible to just remove useless actions from the action space. However, at the moment I would strongly favor a negative reward for the following pragmatic reasons:

The environment is work in progress right now and if we want to keep our algorithms compatible with other gym envs it is way easier right now to use static state spaces. This allows faster testing atm to see whether a problem is environment or algorithm specific.
I think we slow down the algorithms by introducing logic to use dynamic action spaces because we can probably not fall back to fast numpy arrays. I can imagine that learning to not repeat repair actions is therefore still faster than just not allowing the action.

Regarding the decreasing impact of later repairs. I think this is environment specific and should be modeled in the environment because it's the environment that models the reward system and the agents responsibility is just to learn the reward system. If we would put the decreasing reward logic into the agent we would put environment specific logic into the agent/algorithm.

christianadriano commented 4 years ago

Great discussion. It is logical to avoid actions that do not make sense. However, this 'do not make sense' is from the perspective of the environment. The agent might not know that a <component,failure>was successfully fixed or even broke again after a successful fix.

Hence, I believe that negative rewards are a more flexible and general way of modeling this uncertainty of fix/not fixed/failed to fix.

In the future the environment could be extended to considere three types of reward: (1) reward for fixing a broken component (positive high) (2) reward for trying to fix a component that was already fixed (negative high reward) (3) reward for failing to fix a component (zero reward)

This last type of reward could allow the agent to learn that some components are more difficult to fix.

This would even allow to model fixing dependencies between components. i.e., given two pairs of component-failure, <C1,F1> and <C2,F1>, the first pair can only be fixed after the second pair has been successfully fixed, otherwise fixing <C1,F1> will always fail.

btw., all this discussion could be copy-pasted to the final report.

hpi-sam / rl-4-self-repair

Enviornment setup needs to be changed. #16