marlbenchmark / on-policy

This is the official implementation of Multi-Agent PPO (MAPPO).
https://sites.google.com/view/mappo
MIT License
1.28k stars 294 forks source link

很好的工作,但是... Good work, but ... #2

Closed hijkzzz closed 3 years ago

hijkzzz commented 3 years ago

如果你们让QMIX用8个进程,增大Batch Size和或者每次训练epoch次数,最后加上 TD(lambda)<=0.5 QMIX能把这些算法干趴下 参考我们的简略调参 https://arxiv.org/abs/2102.03479 我真的发现 MARL这个领域 由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了,涉及十篇+ CCFA顶会paper 尤其是 AAAI 这个会议的文章连证明是错的都能 accept

hijkzzz commented 3 years ago

English Version

If you let QMIX also use 8 processes, Increase the Batch Size and or the number of epochs per training, and finally add TD(lambda)<=0.5 QMIX can beat all these algorithms. Refer to our simple tuning: RIIT https://arxiv.org/abs/2102.03479 I really found that MARL is a field where the problem of hyperparameters leads to a lot of wrong conclusions and experimental results and even the motivation is wrong from the beginning, involving 10+ CCFA top papers

jxwuyi commented 3 years ago

Thanks for the comments.

We are not surprised that with better tuning, QMix can be even better. However, this is parallel to our focus. What our paper shows is that policy gradient algorithms, which are typically assumed to be less sample efficient, can be surprisingly effective compared with the off-policy algorithms, which are used the most in the existing MARL literature.

Regarding the wrong conclusions and motivations, as one of the initial contributors of deep MARL literature, I partially agree that some of the conclusions might worth re-examinination but disagree that they are all wrong. It is possible that the execution procedure of research ideas can be improved but strong evidence would be expected before making such claims. I would encourage the authors to carefully study the existing literature and analyze what specifically is problematic to benefit the community.

Here is a really nice example of how researchers from Google did so in a scientific manner previously: https://arxiv.org/abs/1802.10031

TonghanWang commented 3 years ago

English Version

If you let QMIX also use 8 processes, Increase the Batch Size and or the number of epochs per training, and finally add TD(lambda)<=0.5 QMIX can beat all these algorithms. Refer to our simple tuning: RIIT https://arxiv.org/abs/2102.03479 I really found that MARL is a field where the problem of hyperparameters leads to a lot of wrong conclusions and experimental results and even the motivation is wrong from the beginning, involving 10+ CCFA top papers

Hi, thanks for your search about the implementation tricks. However, I think more experimental results and theoretical analyses are expected to support your strong conclusion.

Most of the papers studied in your work are not motivated by tasks in the SMAC benchmark. Some SMAC tasks do not require a high degree of coordination among agents and typically do not suffer from problems like relative overgeneralization. For example, on the map corridor, independent PPO has good performance. On more complex tasks like predator and prey, theoretically and empirically, algorithms like WQMIX and QPLEX would have better performance.

Therefore, results on several maps from the SMAC benchmark without analyses of the task structure may not benefit the MARL community. More experiments characterized by unique challenges of multi-agent learning would be expected for a more beneficial discussion.

hijkzzz commented 3 years ago

English Version If you let QMIX also use 8 processes, Increase the Batch Size and or the number of epochs per training, and finally add TD(lambda)<=0.5 QMIX can beat all these algorithms. Refer to our simple tuning: RIIT https://arxiv.org/abs/2102.03479 I really found that MARL is a field where the problem of hyperparameters leads to a lot of wrong conclusions and experimental results and even the motivation is wrong from the beginning, involving 10+ CCFA top papers

Hi, thanks for your search about the implementation tricks. However, I think more experimental results and theoretical analyses are expected to support your strong conclusion.

Most of the papers studied in your work are not motivated by tasks in the SMAC benchmark. Some SMAC tasks do not require a high degree of coordination among agents and typically do not suffer from problems like relative overgeneralization. For example, on the map corridor, independent PPO has good performance. On more complex tasks like predator and prey, theoretically and empirically, algorithms like WQMIX and QPLEX would have better performance.

Therefore, results on several maps from the SMAC benchmark without analyses of the task structure may not benefit the MARL community. More experiments characterized by unique challenges of multi-agent learning would be expected for a more beneficial discussion.

都是中国人,那我直接用中文回答吧。假设 Predator Prey 比 SMAC 更复杂吧,很明显这个 Task 更适合用连续控制来做(参考 MADDPG)。 我直接用 Deep Multi-Agent Reinforcement Learning for Decentralised Continuous Cooperative Control 里面的实验结果说明。 这里使用的 Predator-Prey 是完全合作性质的,如下图所示,所以干掉单调性约束性能增强了?加上我论文中的 3个 ablation,总计有 6 个task 表现出单调性提升了性能,在 SMAC下面尤其重要。你们的 QPLEX 也不过是转移了单调性约束从 Q 到 Advantage,并没有完全去除,RIIT的目的是探索完全去除后的影响。

We are all Chinese, so I'll answer directly in Chinese. Assuming that Predator Prey is more complex than SMAC, it is obvious that this Task is more suitable for continuous control (refer to MADDPG). I will directly use the experimental results inside Deep Multi-Agent Reinforcement Learning for Decentralised Continuous Cooperative Control to illustrate. The Predator-Prey used here is fully cooperative in nature, as shown in the figure below, so taking out the monotonicity constraint enhances performance? Together with the 3 ablations in my paper, there are 6 tasks in total that show monotonicity improvement in performance, which is especially important under SMAC. Your QPLEX only shifts the monotonicity constraint from Q to Advantage, it does not remove it completely, and the purpose of RIIT is to explore the impact of removing it completely.

image

也就是说 WQMIX, QPLEX 的实验只有 Matrix Game 有那么一点可信度,然而那个 Matrix Game你直接用 MADDPG 秒解,何必绕这么远去搞一堆不知道有没有用的所谓的 theory。 再来谈谈DOP吧,我真的尽心尽力在让代码跑起来了,然而还是会崩溃。所以无奈加一个entropy终于能跑了,但是所谓的性能提升几乎等于没有。

另外,你说的更多的理论支撑和实验证明我后面会补充(这个月我就会去加),目前忙别的事先不更新论文了。 我对批判别人的文章毫无兴趣,只是希望后面的研究大家能公平做实验,不然这个领域就烂了。 说实话,MARL发展的好坏与我有啥关系。我也没兴趣去搞算法大比拼。

最后,reddit上那个小号8成也是你的吧,你可能在DRL的微信群里面。不用紧张,我毫无敌意,毕竟水文不止一两篇,和我也没关系。

That is to say, WQMIX, QPLEX experiment only Matrix Game has so little credibility, but that Matrix Game you directly use MADDPG seconds to solve, why go so far to get a bunch of do not know whether there is no use of the so-called theory. Let's talk about DOP, I really tried my best to make the code run, but it still crashes. So I had no choice but to add entropy to finally run, but the so-called performance improvement is almost equal to nothing.

In addition, you said more theoretical support and experimental proof I will add later, currently busy with other prior not update the paper. I am not interested in criticizing other people's articles, I just hope that we can do experiments fairly in the later research, otherwise, this field will be rotten. To be honest, what does it matter to me if MARL is developing well or badly. I'm also not interested in getting into an algorithm contest.

Lastly, that comments on Reddit is probably yours too, you're probably in the DRL weixin group. Don't be nervous, I have no hostility, after all, there are more than one or two pieces of hydrology, and I have nothing to do with it.

hijkzzz commented 3 years ago

Thanks for the comments.

We are not surprised that with better tuning, QMix can be even better. However, this is parallel to our focus. What our paper shows is that policy gradient algorithms, which are typically assumed to be less sample efficient, can be surprisingly effective compared with the off-policy algorithms, which are used the most in the existing MARL literature.

Regarding the wrong conclusions and motivations, as one of the initial contributors of deep MARL literature, I partially agree that some of the conclusions might worth re-examinination but disagree that they are all wrong. It is possible that the execution procedure of research ideas can be improved but strong evidence would be expected before making such claims. I would encourage the authors to carefully study the existing literature and analyze what specifically is problematic to benefit the community.

Here is a really nice example of how researchers from Google did so in a scientific manner previously: https://arxiv.org/abs/1802.10031

吴教授,您好 我觉得您的这份工作很好,只是提出一些存在的问题。 谢谢

Hi, Professor Wu I think this paper of your team is good, just to raise some minor problems. Thanks