hpi-sam / Robust-Multi-Agent-Reinforcement-Learning-for-SAS

Research project on robust multi-agent reinforcement learning (marl) for self-adaptive systems (sas)
MIT License
0 stars 0 forks source link

What would be an appropriate negative reward for fixing a not failing component? #21

Closed ulibath closed 2 years ago

ulibath commented 2 years ago

Just inverting the component utility might have a bad influence on the training

christianadriano commented 2 years ago

Initial ideas. Zero reward or negative reward (of the actual failing component). This still does not guarantee that we might force a shop to be deprioritized because an error in the action selection. If that is the case, the solution might not be possible at the agent level. The rank-learner/coordinator might be a better realm to decide the magnitude of the punishment because the rank-learner has a more global perspective of all shops involved. The ideal punishment should be one that still keeps the <component failure, shop> within the correct ranking position.

Thoughts?

jocodeone commented 2 years ago

If we have a wrong fix, we should apply a negative reward to have the agent learn that this was bad faster. A zero reward does imply that this fix had no negative effect. To be consistent with a software system, we should either reduce the utility of the still failing component or reduce the utility of the wrongly fixed component or both. The RidgeRegression is not using the current utility for predicting the utility that is used for the ranking of a <component failure, shop>. Therefore we can change this attribute without any impact on the ranking.

@christianadriano could you explain the deprioritization? Does this mean that we're ranking a shop lower if there were happing more false positives in the past?