Closed zhonghai1995 closed 4 months ago
Hi @zhonghai1995 , this is very cool. Looks like you are trying to do some offline-to-online training. Let me take some time to look into why it doesn't seem to be working. Ill get back to you as soon as possible. I am working on it now.
I think I know what is going on. The QMIX system (qmix.py
) has an argument called eps_decay_timesteps=50_000
. This means that the qmix+cql.py
system will use epsilon-greedy action selection for the first 50000 timesteps. That means your system is choosing random actions when it goes online. Try setting that value to zero.
I see in qmix_cql.py
I did not expose the eps_decay_timesteps
argument. So you may want to just modify the code a bit so that you can change it.
By the way, I have found IDRQN+CQL works better than QMIX+CQL.
I see and it solves the problem. thanks again!
I am glad its working. We actually also did a research project on offline-to-online MARL which you might find interesting. You can find it here:
By the way, did you tried to use omar for discrete actions? I tried the gumbel max trck in the smac envrionments, and the performance is bad.
I have also not successfully implemented OMAR for discrete actions. I have seen other people also have challenges with this. See here: https://github.com/thu-rllab/CFCQL/issues/1
hi @jcformanek, I tried omar on 2ant in mamujoco, with good dataset. I used adam instead of rmsprop, increase the hidden sizes to 256, and it seems the performance is better, roughly matches the peformance of BC and ITD3 bc in Table D.5 in your paper. Please have a look.
I run for more seeds, now the result is across 20 seeds. and it looks like it could acheive an average of roughly 1700 mean reward for good dataset in 2ant scenario, worse than bc based methods, but sill better than reported in the table.
Oh that is great, thank you for sharing. We will work on updating all of the benchmark results.
One more question, why you increase the env step here? thanks!
Thats used to control the epsilon greedy exploration. It only has an effect if you train online.
But you also increase the environment step here. And the default argument of explore for select action function is True, then you would increase the environment step counter twice for a single environment step,is this expected?
Oh I see. I think you are right! That would result in exploration decreasing 2x faster than I expected. You are welcome to open a PR to fix it if you like. Alternatively, I can attend to it
https://github.com/instadeepai/og-marl/blob/68db0c007c73a06197f7b66d453ee4dd7429434e/og_marl/tf2/systems/qmix_cql.py#L191-L192 I also find here cql loss is not multiplied by its weight. is this as expected? If no, please fix them
I have just merged (#28) in a fix for this and for the env_step_ctr
bug. Thank you so much for finding and reporting these bugs. I really appreciate your contributions. Let me know if you find any more.
hi @jcformanek , I see you added more benchmark results for datasets from previous works, thanks for this and it is really helpful. I wonder if I want to convert the dataset of omar's mpe (other than the simple spread), how can I do it? Also, do I need to calculate the normalized score by myself? If so, where can I find the expert, random score of the datset it self? Thanks so much
I am glad you find it helpful. Ill upload the datasets for the other scenarios, we already converted them. The challenge we faced on those scenarios is that the MPE environment code they used depended on loading in a pre-trained model (PyTorch) for the adversaries. If you can properly instantiate the environment for evaluation, then everything should work fine.
With regards to normalisation, the CFCQL paper says they normalise in one way, but if you inspect the code you can see they simply normalise by dividing by the mean episode return of the dataset. You need to do the normalisation yourself, yes.
Thanks! I am trying to run this simple spread environment from offline to online, during online I need state, but in the infos obtained from step are just info_n [{}, {}, {}]. What are the states for the mpe simple spread? I could extract this by myself, but I do not know how it is composed. Please help me. Thanks so much!
Thanks! I am trying to run this simple spread environment from offline to online, during online I need state, but in the infos obtained from step are just info_n [{}, {}, {}]. What are the states for the mpe simple spread? I could extract this by myself, but I do not know how it is composed. Please help me. Thanks so much!
I think I figure it out, the state is just the concatenation of the three agent's observations.
Yes, I think you are correct! Also @callumtilbury is uploading the other MPE vaults now. We will add the download link to the file og_marl/offline_dataset.py
.
Hi @zhonghai1995 👋🏻 Here are the MPE vaults from OMAR:
"mpe_omar": {
"simple_spread": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_spread.zip"},
"simple_tag": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_tag.zip"},
"simple_world": {"url": "https://s3.kao.instadeep.io/offline-marl-dataset/omar/simple_world.zip"},
}
Note that for the simple_world
and simple_tag
scenarios, the observation dimensions are not homogenous, so we pad them with -inf
when appropriate.
The vault conversion code can be found here: https://bit.ly/vault-conversion-notebook. For OMAR's MPE datasets, see Example 4.
Please let us know if you have any further questions or problems! 🚀
@callumtilbury This is super helpful for me! Thanks so mcuh!
I am going to convert this "issue" into a "discussion" and then we can continue discussing using OG-MARL for offline-to-online MARL. :rocket:
Thanks so much for your work, I find it very helpful.
I am confused with a problem, that I trained with qmix+cql,startcraft v1, 3m scneario, firstly offline and then online. I comment out the trainning part in the train online function, so It is actually just evaluating the performance. But I see a big performance discrepency, the offline training achieves around 20 epsiode return, but when I just evluate it in the train_online, it has a much lower return ranging roughly around [2,4]. I am very confused, and hope you could provide some help or insight into it. Thanks so much!
Below are my episode return curves for offline and online. I aslo paste my online trainning code and part of my main() function.
def main(_): config = { "env": FLAGS.env, "scenario": FLAGS.scenario, "dataset": FLAGS.dataset, "system": FLAGS.system, "backend": "tf2", }