A bug when choosing actions

In line 63 of "rollout.py", relative code is

if self.args.alg == 'maven':
    action = self.agents.choose_action(obs[agent_id], last_action[agent_id], agent_id,
                                                       avail_action, epsilon, maven_z, evaluate)
else:
    action = self.agents.choose_action(obs[agent_id], last_action[agent_id], agent_id,
                                                       avail_action, epsilon, evaluate)

In else branch, it pass evaluate as the parameter maven_z for function choose_action. The correct code is

if self.args.alg == 'maven':
    action = self.agents.choose_action(obs[agent_id], last_action[agent_id], agent_id,
                                                       avail_action, epsilon, maven_z, evaluate)
else:
    action = self.agents.choose_action(obs[agent_id], last_action[agent_id], agent_id,
                                                       avail_action, epsilon, evaluate=evaluate)

After correcting it, the evaluation of policy gradient tends to be extremely unstable. In line 109 of "agent.py",

if epsilon == 0 and evaluate:
    action = torch.argmax(prob)
else:
    action = Categorical(prob).sample().long()

I think it a mistake for taking argmax of prob when doing evaluation, because policy gradient is learning the probability of the policy $\pi$. We should also sample it, just use the code below

action = Categorical(prob).sample().long()

I have tried and it truly works!

starry-sky6688 / MARL-Algorithms

A bug when choosing actions #94