Implementations of IQL, QMIX, VDN, COMA, QTRAN, MAVEN, CommNet, DyMA-CL, and G2ANet on SMAC, the decentralised micromanagement scenario of StarCraft II
After correcting it, the evaluation of policy gradient tends to be extremely unstable. In line 109 of "agent.py",
if epsilon == 0 and evaluate:
action = torch.argmax(prob)
else:
action = Categorical(prob).sample().long()
I think it a mistake for taking argmax of prob when doing evaluation, because policy gradient is learning the probability of the policy $\pi$. We should also sample it, just use the code below
In line 63 of "rollout.py", relative code is
In else branch, it pass evaluate as the parameter maven_z for function choose_action. The correct code is
After correcting it, the evaluation of policy gradient tends to be extremely unstable. In line 109 of "agent.py",
I think it a mistake for taking argmax of prob when doing evaluation, because policy gradient is learning the probability of the policy $\pi$. We should also sample it, just use the code below
I have tried and it truly works!