instadeepai / og-marl

Datasets with baselines for offline multi-agent reinforcement learning.
https://instadeepai.github.io/og-marl/
Apache License 2.0
136 stars 12 forks source link

A question about the buffer size #10

Closed zyh1999 closed 4 months ago

zyh1999 commented 10 months ago

I would like to know the size of the buffer used in the SMAC scenario in the paper, as I reproduced the experiment with a buffer size of 100,000 on the 3m_good dataset and found that the performance of the BC and BCQ methods was significantly better than the results shown in your paper. At the same time, I noticed that the original size of this dataset is 120,569. I suspect that the reason for the above phenomenon is that the 20,569 trajectories I did not obtain are relatively random or inferior trajectories.

jcformanek commented 10 months ago

Hi @zyh1999, I suspect the difference in performance is due to the missing trajectories. The results in the paper used all of the trajectories. Can you try re-run your experiments, making sure to set the buffer size to be greater than the number of trajectories in the respective dataset. You can get the number of trajectories in the dataset from the table in the appendix of the paper. Ill work on a feature to set the buffer size automatically.

zyh1999 commented 10 months ago

Thanks, I have test 3m_good buffer including 120569 trajectories, but the performance of BC method is still much better than the results shown in your paper. Another thing confuses me is that I check the average return for the 3m_good buffer (mean Episode return), but the average return is about 10.69, much far from 16.0 in your paper. Here is the code I used: ` def my_get_mean_return(args, environment_label="3m",dataset_type="Good"):

env = smac.SMAC(environment_label) # Change SMAC Scenario Here
dataset = env.get_dataset(dataset_type, args.datasets_base_dir)# Change Dataset Type Here
sample_cnt =0
tot_reward = 0
for index, sample in enumerate(iter(dataset)):
     sample_cnt+=1
     agent = next(iter(env.agents))
     tot_reward += sample.rewards[agent].numpy().sum()
print("the mean reward is: ", tot_reward / sample_cnt)

` I would like to ask if my way of calculating the average return is incorrect.

jcformanek commented 10 months ago

Hi there, I will be back at my PC on Monday and will be able to investigate the discrepancy in the reported performance for BC on 3m then. But in the mean time I wanted to to respond to your question about the episode return. There is a problem with how you are computing the average return. Each sample contains 20 sequential timesteps, which are not necessarily an entire episode (episodes in 3m are usually around 50 timesteps long). So an episode may be split across two samples. When the remainder of an episode does not fill out the entire 20 timesteps in a sample, we zero pad the end of the sample. To work around this, I recorded the episode_return that is associated with each sample in the dataset which can be accessed like sample["episode_return"].

I have added an example of how to compute the mean episode return to the script examples/download_dataset.py. Please refer to it and let me know if anything is unclear. Make sure you use the latest commit on the main branch.

If it would be helpful to you, I can work on stitching the samples together into entire episodes, rather than 20 timestep snippets.

zyh1999 commented 10 months ago

Thanks, so do you mean in the buffer, you may divide a completed trajectory into two samples, so that the average return I calculated is smaller than the original one? May I ask why you should do this "dividing operation"? Is a completed trajectory not good?

zyh1999 commented 10 months ago

Another thing I would like to inquire about is whether these baseline performances will eventually converge or if they have only reached a higher point in the past. For example, when I ran qmix on the 3m_good scenario, it seems that qmix's performance only reached around 13.8 in the early stages of training and then gradually decreased to a much lower value.

jcformanek commented 9 months ago

The reason the samples are only portions of an entire trajectory is simply a relic of how my replay buffer was implemented. It was convenient to unroll the recurrent neural networks over shorter sequences (e.g. 20 timesteps) rather than the full episode because longer sequences pose several challenges. For example, in environments with many agents, it can be challenging to do the RNN unrolling calculation across all agents and the entire episode without running out of VRAM on the GPU. Using shorter sequences was an easy work around. Furthermore, training RNN policies on long sequences can sometimes cause instability during training and simply using shorter sequences can mitigate that.

Having said that, I do want to add support to OG-MARL for loading entire trajectories from the dataset. Now that I am back at work, Ill try implement it as soon as possible.

jcformanek commented 9 months ago

With regards to your second question, when training offline, the main challenge is compounding extrapolation error on out-of-distribution actions. Basically, when training offline, the neural network is likely to erroneously assign a high-value to out-distribution actions (i.e. actions not seen in the dataset). This is called extrapolation error. In online RL such erroneous extrapolations are quickly corrected through interactions and feedback from the environment. But in offline RL such feedback is not available and errors are not corrected. These errors then compound during the course of training due to the bellman update where the value of a given state-action pair is updated to be the reward plus the max value over next actions. That max operation means that erroneously high values are preferred and ultimately cause the performance of the policy to collaps. Thus, in offline RL, the longer one trains for, the more likely it is that performance has begun to degrade due to compounding extrapolation error on out-of-distribution actions. I hope this clarifies things for you.

zyh1999 commented 9 months ago

So you mean that you selected the highest performance during training period as the data for the table in your paper, rather than the final performance?

jcformanek commented 9 months ago

I agree that reporting the best performance during training would not be fair in the offline setting. Therefore, we did not do that. We set an offline training budget (50 000 offline training steps in SMAC) and then reported the final performance at the end of that, for all algorithms. Furthermore, we tuned hyper-parameters on 3m only and kept them fixed on all other scenarios.

For your reference, here is a WANDB report I made with all of the runs. You can see that I reported the final performance of the run in the table. https://api.wandb.ai/links/off-the-grid-marl-team/0iopyeen

zyh1999 commented 9 months ago

Thank you for your detailed explanation! But it seems that my qmix (based on https://github.com/oxwhirl/pymarl/blob/master/src/modules/mixers/qmix.py) always starts to rapidly decline in performance after fewer than 10,000 training steps. I may need to take a closer look at your code and settings.

zyh1999 commented 9 months ago

Hello, I noticed the hyperparameters in the appendix said that you use the soft targert update rate for qmix, but I found it has been deleted in https://github.com/instadeepai/og-marl/blob/e752264b66fbdb115eb617d537773ac54f106d81/og_marl/tf2/systems/idrqn.py#L267. So, which version is used for the original performance in paper?

zyh1999 commented 9 months ago

By the way, I test BC method on terran_5_5, It also much higher than the performance in your paper. And my performance of BC method is similar with the "sample mean return". I speculate that this may mean that even for smacv2, it is difficult to form multiple completely different optimal solutions. Because if there are very different optimal solutions in the buffer, it would make the BC algorithm much lower than the "sample mean return."

zyh1999 commented 8 months ago

Hi, have you investigate the discrepancy of BC method yet?

jcformanek commented 8 months ago

Sorry for the late reply. We have been very busy recently. We are releasing a fairly big update to OG-MARL soon, which should make its easier to use.

But to try and answer your questions.

Hello, I noticed the hyperparameters in the appendix said that you use the soft targert update rate for qmix, but I found it has been deleted in So, which version is used for the original performance in paper?

In the paper we used soft-updates. But since refactoring the code we swapped to using hard-target updates to be similar to the MAICQ implementation.

By the way, I test BC method on terran_5_5, It also much higher than the performance in your paper. And my performance of BC method is similar with the "sample mean return". I speculate that this may mean that even for smacv2, it is difficult to form multiple completely different optimal solutions. Because if there are very different optimal solutions in the buffer, it would make the BC algorithm much lower than the "sample mean return."

I am re-running the refactored baselines on terran_5_vs_5 to see if there is a discrepancy in the BC results. I will share the results with you as soon as possible.

jcformanek commented 8 months ago

Hi @zyh1999

Here are the results from my latest benchmarking sweep: https://api.wandb.ai/links/off-the-grid-marl-team/fmnpbz44

The mean performance of BC on terran_5_vs_5 seems inline with what we reported in the paper. Maybe slightly better(8.4 rather than 7.3), but I did only run 5 seeds this time. I also ran these experiments using the code on the branch feat/replace_cpprb_withflashbax.

We will be merging that branch into main today.

zyh1999 commented 8 months ago

Thanks for your latest report. It also seems qmix totally fall down at the end of the training, which is also different from the original results and similar with my early response:

Another thing I would like to inquire about is whether these baseline performances will eventually converge or if they have only reached a higher point in the past. For example, when I ran qmix on the 3m_good scenario, it seems that qmix's performance only reached around 13.8 in the early stages of training and then gradually decreased to a much lower value.

zyh1999 commented 8 months ago

And inTerran_5_5, it seems the perfomaces of most of the methdos except the DBC become lower than the original. I will also reproduce then with my code(based on [https://github.com/oxwhirl/pymarl/]) soon.

zyh1999 commented 8 months ago

Hello, As you have updated the code and fixed some bugs, and the experimental results you subsequently presented (as mentioned above) also a little differ from those in the original paper. I sincerely hope that you could update the experimental results provided in the original paper, as it will facilitate the comparison for our subsequent work. Thank your very much!

jcformanek commented 8 months ago

Hi @zyh1999, thanks for all the discussions! Ill update the results in the paper this week! Good luck with your research! I look forward to reading it when you have published it.

zyh1999 commented 7 months ago

Hi, have you updated the experimental results? Because I just found out that when running some experimental environments (e.g., 5m_vs_6m_good) using the original MAICQ code(https://github.com/yiqinyang/icq), the loss quickly explodes into inf. I would like to know if you have encountered this problem with the replicated MAICQ(tf version).

zyh1999 commented 7 months ago

Also, I would like to know how many different well-trained models typically collect the remaining data in your dataset, such as "Good", in order to ensure the diversity of the similar performance strategies, excluding any additional perturbation data?

zyh1999 commented 6 months ago

Hello, have the results in paper updated?

jcformanek commented 6 months ago

Hi there, sorry about the late reply.

We used at least 4 independently trained policies for each dataset and we included epsilon=0.05 exploration to ensure diversity.

Concerning the result we have not finished all the benchmarks we were hoping to do before we update the paper. But if you let me know which results are particularly urgent for you, I can update those. Are they the SMAC results? And do you want the results in the paper to reflect the results here: https://instadeepai.github.io/og-marl/baselines/smac_v1/?

zyh1999 commented 6 months ago

The main thing particularly urgent for me is I found the loss of official MAICQ code(https://github.com/yiqinyang/icq) sometimes quickly explodes into inf(like 8m and 5m_vs_6m). But it seems not happened in your Tensorflow version MAICQ code, I want to know are there any differences between them? In other word, I found the result of MAIC I reproduced with official MAICQ code is not always as stable as yours.

jcformanek commented 6 months ago

Yeah, I also found their official implementation can be unstable. But even my version explodes on some datasets like on our Flatland datasets. But as you found, our version is at least always stable on SMAC.

Unfortunately, I am not 100% sure why that is. I am not aware of any major differences between our implementations. But I did implement it quite a while ago, so I may have forgotten about something. But this kind of instability can also come from a very minor implementation detail which may be hard to track down.

zyh1999 commented 6 months ago

Sorry for bothering you again. I roughly looked at your code for the MAICQ in the TensorFlow version, and it seems there are two minor differences.

First, it appears that you did not use "td_lambda_targets" in og_marl/tf2/systems/maicq.py, 249 line.

Second, you did not apply "clip_grad_norm" during gradient descent. I'm not sure if I missed anything?

Additionally, I would like to ask if you made any recent modifications to the network aspect of the og-marl version of MAICQ. I noticed that you seemed to add a CNN embedding. Will this have any effect on the performance in the SMAC environment, considering that the input for the SMAC environment is not an image?

jcformanek commented 6 months ago

Hi there, if I was you, I would try removing gradient clipping from their implementation. I have sometimes found that it effects the stability of algorithms that use mixing networks. With regards to TD Lambda, I had implemented it previously and found it did not help that much, so I removed it. So, I do not think that is the reason that the original MAICQ is failing.

The CNN embedding is not used in SMAC. I only add a CNN embedding for environments with pixel observations (e.g. the PettingZoo datasets).

zyh1999 commented 6 months ago

Ok, thanks, I will try it.

zyh1999 commented 6 months ago

Hello, may I ask if it is possible to obtain the original data set without trajectory splitting or if you could please provide guidance on how to recover the complete trajectory data set based on the existing code? I have observed that ICQ's approximation estimation using softmax based on mini-batches results in larger errors by trajectory splitting . In other words, instead of performing softmax for a batch of s_t, it now involves performing softmax for st and s{t+20} together.

jcformanek commented 6 months ago

Could you look at this tutorial and let me know if it does what you want.

https://colab.research.google.com/github/instadeepai/og-marl/blob/main/examples/dataset_api_demo.ipynb