Play the environment with other RL algorithms

Screamer-Y commented 2 years ago

Hi everyone, I'm interested in how the environment in this project performs under other reinforcement algorithms. But I'm new to reinforcement learning and not yet capable of implementing other reinforcement learning algorithms on my own. I noticed that algorithms such as Q-Learning and DQN have been implemented in the baselines directory. And the definition of the environment is quite different from some basic gym environments (e.g. CartPole), which made it difficult for me to understand. I wonder if it is possible to implement different RL algorithms in the environments of this project with the help of some RL algorithm libraries? (e.g. Stable-Baselines3, Keras-rl) I would greatly appreciate the help. Thank you

2twentytwo2 commented 2 years ago

您好，您的邮件我已收到，看到后，我会立即回复您。 This is an automatic reply, confirming that your e-mail was received.Thank you

Screamer-Y commented 2 years ago

Hi everyone, I noticed that PR(#86) already has support for stable-baselines3 and gives the corresponding use cases in the notebooks folder, which is of greate help. Sorry for missing that before...

kvas7andy commented 2 years ago

Hi, @Screamer-Y, actually that is a interesting topic for me to investigate. If you will have any problems or you find any literature using this simulator, feel free to share, and I can do the same if you are interested.

Edited: I found the source of the error it was purely because of my own additions to the simulator.

Screamer-Y commented 2 years ago

Hi, @Screamer-Y, actually that is a interesting topic for me to investigate. If you will have any problems or you find any literature using this simulator, feel free to share, and I can do the same if you are interested.

Edited: I found the source of the error it was purely because of my own additions to the simulator.

Hi, @kvas7andy , it's glad to know you're interested in this topic too. Now I'm working on https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/_env/flatten_wrapper.py. My goal is to modify the 'action_space' from a 'spaces.MultiDiscrete' to a 'spaces.Discrete'. I do this because in stable-baselines3, if you want to use the value-based RL algorithm, action_space can only be 'spaces.Discrete'. Currently I have simply mapped all the possible actions to a Discrete value, but when a try to train a DQN agent, it can not learn from the environment properly. I'm still trying to figure out what went wrong. Here is my modification,

class FlattenActionWrapper(ActionWrapper):
    """
    Flatten all nested dictionaries and tuples from the
     action space of a CyberBattleSim environment`CyberBattleEnv`.
     The resulting action space is a `Discrete`.
    """

    def __init__(self, env: CyberBattleEnv):
        ActionWrapper.__init__(self, env)
        self.env = env

        # local:source_node_num x local_attacks_count;remote:source x target x remote_attacks_count;
        # connect:source x target x remote_attacks_count x port x credentials
        self.action_space = spaces.Discrete(env.bounds.maximum_node_count*env.bounds.local_attacks_count + \
            env.bounds.maximum_node_count*env.bounds.maximum_node_count*env.bounds.remote_attacks_count + \
            env.bounds.maximum_node_count*env.bounds.maximum_node_count*env.bounds.port_count*env.bounds.maximum_total_credentials)

    def action(self, action: np.int64) -> Action:
        n_nodes = self.env.bounds.maximum_node_count
        n_local_attacks = self.env.bounds.local_attacks_count
        n_remote_attacks = self.env.bounds.remote_attacks_count
        n_port = self.env.bounds.port_count
        n_credentials = self.env.bounds.maximum_total_credentials

        if action<n_nodes*n_local_attacks:
            source_node = action//n_local_attacks
            local_vulnerability = action%n_local_attacks
            return {'local_vulnerability': np.array([source_node, local_vulnerability])}

        action -= n_nodes*n_local_attacks
        if action < n_nodes*n_nodes*n_remote_attacks:
            source_node = action//(n_remote_attacks*n_nodes)
            target_node = (action//n_remote_attacks)%n_nodes
            remote_vulnerability = action%n_remote_attacks
            return {'remote_vulnerability': np.array([source_node, target_node, remote_vulnerability])}

        action -= n_nodes*n_nodes*n_remote_attacks
        if action < n_nodes*n_nodes*n_port*n_credentials:
            source_node = action//(n_nodes*n_port*n_credentials)
            target_node = (action//(n_port*n_credentials))%n_nodes
            port = (action//(n_credentials))%n_port
            credential = action%n_credentials
            return {'connect': np.array([source_node,target_node,port,credential])}

        raise NotSupportedError(f'Unsupported action: {action}')

    def reverse_action(self, action):
        raise NotImplementedError

I'm not a good programmer, so feel free to point out any problem and I will appreciate it.

forrestmckee commented 2 years ago

Hi, @Screamer-Y, actually that is a interesting topic for me to investigate. If you will have any problems or you find any literature using this simulator, feel free to share, and I can do the same if you are interested. Edited: I found the source of the error it was purely because of my own additions to the simulator.

Hi, @kvas7andy , it's glad to know you're interested in this topic too. Now I'm working on https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/_env/flatten_wrapper.py. My goal is to modify the 'action_space' from a 'spaces.MultiDiscrete' to a 'spaces.Discrete'. I do this because in stable-baselines3, if you want to use the value-based RL algorithm, action_space can only be 'spaces.Discrete'. Currently I have simply mapped all the possible actions to a Discrete value, but when a try to train a DQN agent, it can not learn from the environment properly. I'm still trying to figure out what went wrong. Here is my modification,
class FlattenActionWrapper(ActionWrapper):
    """
    Flatten all nested dictionaries and tuples from the
     action space of a CyberBattleSim environment`CyberBattleEnv`.
     The resulting action space is a `Discrete`.
    """

    def __init__(self, env: CyberBattleEnv):
        ActionWrapper.__init__(self, env)
        self.env = env

        # local:source_node_num x local_attacks_count;remote:source x target x remote_attacks_count;
        # connect:source x target x remote_attacks_count x port x credentials
        self.action_space = spaces.Discrete(env.bounds.maximum_node_count*env.bounds.local_attacks_count + \
            env.bounds.maximum_node_count*env.bounds.maximum_node_count*env.bounds.remote_attacks_count + \
            env.bounds.maximum_node_count*env.bounds.maximum_node_count*env.bounds.port_count*env.bounds.maximum_total_credentials)

    def action(self, action: np.int64) -> Action:
        n_nodes = self.env.bounds.maximum_node_count
        n_local_attacks = self.env.bounds.local_attacks_count
        n_remote_attacks = self.env.bounds.remote_attacks_count
        n_port = self.env.bounds.port_count
        n_credentials = self.env.bounds.maximum_total_credentials

        if action<n_nodes*n_local_attacks:
            source_node = action//n_local_attacks
            local_vulnerability = action%n_local_attacks
            return {'local_vulnerability': np.array([source_node, local_vulnerability])}

        action -= n_nodes*n_local_attacks
        if action < n_nodes*n_nodes*n_remote_attacks:
            source_node = action//(n_remote_attacks*n_nodes)
            target_node = (action//n_remote_attacks)%n_nodes
            remote_vulnerability = action%n_remote_attacks
            return {'remote_vulnerability': np.array([source_node, target_node, remote_vulnerability])}

        action -= n_nodes*n_nodes*n_remote_attacks
        if action < n_nodes*n_nodes*n_port*n_credentials:
            source_node = action//(n_nodes*n_port*n_credentials)
            target_node = (action//(n_port*n_credentials))%n_nodes
            port = (action//(n_credentials))%n_port
            credential = action%n_credentials
            return {'connect': np.array([source_node,target_node,port,credential])}

        raise NotSupportedError(f'Unsupported action: {action}')

    def reverse_action(self, action):
        raise NotImplementedError
I'm not a good programmer, so feel free to point out any problem and I will appreciate it.

@Screamer-Y, did you get the stable-baselines example script to work? For me it runs, but the agent never learns anything using A2C or PPO.

Screamer-Y commented 2 years ago

@Screamer-Y, did you get the stable-baselines example script to work? For me it runs, but the agent never learns anything using A2C or PPO. Hi @forrestmckee , Yes, From my side it works properly, I just run the code in https://github.com/microsoft/CyberBattleSim/blob/main/notebooks/stable-baselines-agent.py without any modification.

forrestmckee commented 2 years ago

@Screamer-Y are you using Linux, WSL, or Docker?

I can get the script you referenced to run, but the agent never makes it off of the foothold node regardless of the number of time steps I set. I'm also getting warnings that the agent is trying to access an invalid index.

Screamer-Y commented 2 years ago

@Screamer-Y are you using Linux, WSL, or Docker?

I can get the script you referenced to run, but the agent never makes it off of the foothold node regardless of the number of time steps I set. I'm also getting warnings that the agent is trying to access an invalid index.

Hi @forrestmckee , I'm using Ubuntu Server 20.04 LTS. I ran the script again just now and only made one successful connect action with 10000 time steps. I think the problem is due to the way 'action_space' defined in 'flatten_wrapper', which contains all attacks, even if it is invalid and it's also the reason why you keep getting warnings. I have the same problem when turning the 'action_space' into 'spaces.Discrete', one possible solution is reduce the dims of 'action_space' just as the way in ['agent_wrapper].(https://github.com/microsoft/CyberBattleSim/blob/main/cyberbattle/agents/baseline/agent_wrapper.py)

Gabriel0402 commented 1 year ago

@forrestmckee I came across the same issue as you met. But I noticed another interesting thing: during the training, although we got the warnings that the agent is trying to access an invalid index, but the number of nodes discovered so far is increasing. I think this means that the A2C or PPO is actually working. They did discover new nodes. The thing I don't understand is when the trained model is applied to the action prediction, it never discovers new nodes.

@Screamer-Y I don't quite understand why warnings are incorrect. I think you will also see the warnings if you set the logging levels. Because we have to discover new nodes, whose number is less than the maximum node count. So when the nodes are not discovered, we will always get warnings. And I also don't understand why we have to reduce the dims of action space.

forrestmckee commented 1 year ago

@blumu Is there a planned sample_valid_action equivalent for Flattened Environments/Stable Baselines3? I believe what myself and others have discovered is that the entire observation and action spaces are "fair game" for the agent to sample from at any given time. Doesn't this mean that an agent can attempt to take an action both to and from a node that it hasn't discovered yet? This seems to greatly increase the number of time steps required for an agent to learn.

@Screamer-Y were you able to reduce the dims of the action space like you mentioned?

blumu commented 1 year ago

@forrestmckee That's a good suggestion, it's not currently planned but it feels like something we ought to have. Work on stable_baselines3 is currently postponed until they complete the upgrade to the latest version of gym. Once this is done the plan is to upgrade cyberbattlesim to the latest version of gym and stable_baselines which will then allow for further improvements like the one you mentioned.

Screamer-Y commented 1 year ago

Thanks for all the suggestions! @Gabriel0402 I think you are right about the warnings, I didn't have a good understanding of the code at the time.

Regarding the second question, I once expected to speed up the learning process by reducing the size of action_space, and after trying this I found that this did not work significantly.

@forrestmckee @blumu So I still have questions: Is there a significant performance difference between the A2C or PPO methods implemented in Stable-baseline3 and the DQN method implemented in agent_dql?

With the same set of iteration_count=1500 and episode_count=20, I observed in toy-ctf that A2C only gets an average return of no more than 40 per episode, which is far from the average return of about 450 in benchmark. I would be very grateful if you have any better approaches to improve the performance of Stable-baseline3.

forrestmckee commented 1 year ago

@Screamer-Y the difference between the built in algorithms and Stable-Baselines3 is that the built ins have a check to ensure a valid action. SB3 doesn't, so a large portion of the time you're performing an impossible action given the current state of the environment.

Screamer-Y commented 1 year ago

@forrestmckee Thank you so much for the speedy reply. I think I've understood what you've mentioned in this comment and the previous one :)

microsoft / CyberBattleSim

Play the environment with other RL algorithms #91