The repeated discovery of the same node via different methods leads to reward

microsoft / CyberBattleSim

An experimentation and research platform to investigate the interaction of automated agents in an abstract simulated network environments.

MIT License

1.67k stars 260 forks source link

The repeated discovery of the same node via different methods leads to reward #18

Closed jaromiru closed 3 years ago

jaromiru commented 3 years ago

I noticed by generating a random CyberBattleRandom-v0 environment and using the agent with random actions, that if the same node is discovered with different methods (e.g., traceroute / shared files) it leads to a repeated reward.

It seems as incorrect behaviour, because this way the agent is incentivized to repeatedly discover the same nodes with different actions.

blumu commented 3 years ago

@jaromiru That's indeed the current behaviour. We currently only prevent repeated reward when the same attack action is used to discover the node: https://github.com/microsoft/CyberBattleSim/blob/56fdb676c452bf15b2bf709436c2eabed5a3830e/cyberbattle/simulation/actions.py#L317

We could address this by returning a reward of 0 if all the nodes revealed by the attack are all already discovered by the agent, though I am not sure what should be the returned reward when only some of the revealed nodes were known. One option is to return a fraction of the reward given by the number of newly discovered nodes divided by the number of nodes revealed by the attack. Alternatively we could use the intrinsic value assigned to each node to weigh the importance of each newly discovered node when calculating the reward. What do you think?

jaromiru commented 3 years ago

Ok, let's take a step back. In my opinion, the reward should be based on the results, not on whether an attack was successful or not. The network itself can change (you allow it explicitly e.g., in ExternalRandomEvents DefenderAgent), so the same attack can actually result in different results over time.

In the case of discovered nodes, the reward should also be additive, i.e. the total reward for discovering several nodes should be in total the same, regardless of the exact steps taken to discover them. Neither of the proposed solutions seem right. The first one does not cope with the dynamically changing network, or chances that nodes are actually discovered. The second leaks information about nodes' value through the reward to the agent. A straightforward solution seems to be simply a fixed reward per every newly discovered node.

Since this is a quite substantial design decision, it would be better if more competent people gave this a thought, though.

blumu commented 3 years ago

@jaromiru Good points. Alternatively we could also have the environment assign a 'discovery value' to every node in addition to their intrinsic value, and use that instead of a fixed reward when the node gets discovered the first time.

jaromiru commented 3 years ago

I thought more about giving a fraction of the intrinsic reward for a discovery. It leaks some information, true, but that is not necessarily a bad thing. It will point the agent in the direction of high value targets. And compared to the 'discovery value' - one less parameter to worry about. So, for a game-like environment, as this is, it seems allright.

blumu commented 3 years ago

@jaromiru Fair point, though contrary to what I proposed earlier the fraction should not be calculated as value(nodes_newly_discovered) / value(nodes_revealed_by_attack), otherwise this could lead to different rewards depending on the attacks execution order.

For example, suppose you have two attacks A1 and A2 where:

A1 reveals nodes 1, 2 with respective value 1, 2;
A2 reveals node 2, 3 with respective value 2, 3.

Then A1 followed by A2 yields a total reward of 1 + 3/5 while A2 followed by A1 yields a total reward of 1 + 1/3.

blumu commented 3 years ago

In PR #26 that I just submitted, the intrinsic value of the newly discovered nodes gets returned as the reward. Would this be a reasonable solution?

jaromiru commented 3 years ago

I thought that you propose simply [c * sum(intrinsic value of newly discovered nodes)], with some c < 1. That's additive and reasonable. Is that what you implemented?

blumu commented 3 years ago

@jaromiru Yes, that's what I implemented, except that currently c = 1. Also I added a fixed reward for newly discovered credentials and node properties.

jaromiru commented 3 years ago

Ok, I think we can close this for now.