Negative rewards and rewarding new successful attack to already owned nodes

In here, the reward at each step is calculated as reward = max(0., reward). I see the code about penalty in actions.py and I understand why you cancel that. When I tried to remove this max() operation, the reward became highly negative and the agent learns nothing. However, it makes the agent overfit because the reward is so sparse. I think adding some time cost is necessary such as -1 or -0.5.
In here, when giving the reward for NEW_SUCCESSFULL_ATTACK_REWARD, it does not take into consideration whether the attacked the node is already owned. It's meaningless to attack a node already owned by the attacker. It will make the agent repeatedly launch attacks between owned nodes and ignore discovering new nodes.

In my experiment, I trained an agent with the original reward design in the chain env. The agent can perfectly take ownership of the network in training. When I saved the model and evaluate it with epsilon-greedy, the success rate is only about 90%. When I patched the two points I proposed above and trained an agent with the same parameters, the successful rate for evaluation is about 100%. I think the original reward design makes the agent overfit.

Could you please take a look at the two points and give some feedback? Anyway, thanks again for your codes, it helps with my research, and I even would like to use them in my next research project about online learning :)

Originally posted by @sherdencooper in https://github.com/microsoft/CyberBattleSim/issues/46#issuecomment-1136458981

microsoft / CyberBattleSim

Negative rewards and rewarding new successful attack to already owned nodes #62