Questions about reward function setting

syf980302 commented 1 year ago

I'm sorry to bother you again. I still have some questions to ask you. I hope you can help me. In the formation environment you created, the three parts are about reward, outside and collosion; I want to ask if your design for this reward is different for each agent? I think you calculate the reward of each agent separately, but I still don't understand this part. As shown in the figure below, I found that there is no numbering for the agent, so I'm not sure how you designed it.

islambarakat99 commented 1 year ago

Hi, thank you for asking!

In this part I divided the agents into two main components: Leaders and Followers , this adversary variable was to distinguish them. I know this wasn't the most intuitive way to do it, but I was in a hurry back then (during my college study).

Anyway, you can divide your agents into leaders and followers by this variable, so the leader has this adversary variable with true value, while the followers have it with false.

1 - Leader role is very simple: Follow the goal, so it has a specific negative reward corresponding to the distance between it and the goal, and it was calculated according to the Artificial Potential Field Algorithm, you can read about it here

2- Followers role is: Follow the Leader, so the distance here was calculated from the Leader with the same Algorithm.

Note: Many Changes are planned to be executed soon for better performance and clear implementation

Anyway, I am happy to help you any time!

syf980302 commented 1 year ago

First of all, thank you for replying to me in your busy schedule. I have learned a lot from your explanation. I think your reward settings for follower are separate, and each follower is calculated separately. I don't know. Am I right? I'm sorry to take up your time, but I'm really eager to know this aspect. Thank you for your reply. I also have a small question. Are you Chinese? This is my personal question. I just think you are a very nice person. I'm sorry if your privacy is involved.

islambarakat99 commented 1 year ago

Don't worry, of course I am not busy to answer your questions.

I don't think that every follower would have a different reward equation than the other followers. They are all have the same reward function, here you can notice that any agent with agent.adversary equals false has a reward function equals: [Line 71] reward -= np.sqrt(np.sum(np.square(agent.state.p_pos - world.agents[0].state.p_pos))) world.agents[0] refers to the leader, which had been asserted at this position clearly

If you mean that it will differ numerically, of course it will. When I tried giving them the same numerical values the training curve wasn't converging quite well. As every agent faces different conditions of obstacles than the other

Thanks for your nice words. I am Egyptian

syf980302 commented 1 year ago

Thank you for your reply. I understand your meaning and my confusion has been eliminated. To be honest, I feel that the multi-agent formation control based on deep reinforcement learning is somewhat difficult to understand. Thanks to your care and patience, I have a certain understanding.

After completing 50000 times of training, I found that the effect was not so good in the process of testing. There would still be collisions and the agent would not stop after reaching the target, but only after driving out of the interface.

I think it can be improved. I don't know if my idea is right? Do you think there is still room for improvement? Because I am faced with the opening of the thesis, but I am still confused. I can only learn from you and consult with you constantly. I'm sorry to have bothered you, but I really need your help.

I really appreciate your reply to me in spite of your busy schedule. I sincerely say thank you.

islambarakat99 commented 1 year ago

Again, don't worry! I am really happy to help you, you can ask about anything of course.

I agree with you, the performance is not that good, As I said there are many changes that I am planning for this project as it was implemented like 18 months ago and never been touched since, There are many principles that I have learned in Reinforcement Learning and also with Python programming that can be added to this project to enhance the performance, however this will not be very soon.

In the meanwhile there are some tricks you can do to enhance the performance:

First: in the train.py in Lines [27-29]:

parser.add_argument("--save-dir", type=str, default="/home/islam/training/policy/", help="directory in which training state and model should be saved")
parser.add_argument("--save-rate", type=int, default=100, help="save model once every time this many episodes are completed")
parser.add_argument("--load-dir", type=str, default="", help="directory in which training state and model are loaded")

you can see that this piece of code is used to store (the training state and model) in some folder of your choice in the next line, you give it the number of episodes that should save the data after they pass in the last line you load the training data from previous runs, in which default = "" which means that there is no loaded data

This is important as you can load the data from the pervious run, such that every time it runs, it doesn't have to start from the very beginning. Instead it will start from where it stopped last time.

islambarakat99 commented 1 year ago

Second: you can also play with these parameters here, to enhance the performance, in train.py, [Lines 13-24]


# Environment
parser.add_argument("--scenario", type=str, default="formation", help="name of the scenario script")
parser.add_argument("--max-episode-len", type=int, default=120, help="maximum episode length")
parser.add_argument("--num-episodes", type=int, default=50000, help="number of episodes")
parser.add_argument("--num-adversaries", type=int, default=1, help="number of adversaries")
parser.add_argument("--good-policy", type=str, default="maddpg", help="policy for good agents")
parser.add_argument("--adv-policy", type=str, default="maddpg", help="policy of adversaries")
# Core training parameters
parser.add_argument("--lr", type=float, default=1e-2, help="learning rate for Adam optimizer")
parser.add_argument("--gamma", type=float, default=0.95, help="discount factor")
parser.add_argument("--batch-size", type=int, default=1024, help="number of episodes to optimize at the same time")
parser.add_argument("--num-units", type=int, default=64, help="number of units in the mlp")


**Take care about what the help argument is saying!**

for the Environment section, you can customize your environment as much as you want, this is the space where your agents can learn and act
for the Core Training Data section, you must be aware of the Reinforcement Learning principles, to be able to tune your parameters effectively 

In this trick, I advise you not to combine it with the previous one such that no run can load the data from the previous runs which have different parameters set

So for every parameters set you have many runs loading from previous runs as much as you want, but not when any of the parameters is changed

islambarakat99 commented 1 year ago

Third, in formation.py from Lines [49-53]:
```
    for agent in world.agents:  
        agent.state.p_pos = np.random.uniform(0.1, 0.9, world.dim_p)
        agent.state.p_vel = np.zeros(world.dim_p)
        agent.state.c = np.zeros(world.dim_c)  
```
you have random assigning for the initial position of every agent, which make it more difficult to learn and predict. you may don't need this part in your design. So you may assign to them fixed initial starting points, as you like

this is also the case with the Landmarks, that the agents shouldn't collide with in formation.py Lines [34-37]

        for landmark in world.landmarks:
            landmark.color = np.array([0.15, 0.15, 0.15])
            landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
            landmark.state.p_vel = np.zeros(world.dim_p)

you can make their positions random or fixed based on your design

As Stochasticity increases the learning process becomes much harder

That's all, Hope that will increase the performance with you, and help you customize your work as you need.

syf980302 commented 1 year ago

OK, thank you for your valuable suggestions. I will realize and innovate one by one. If I have any substantial progress, I will share it with you as soon as possible.

Thank you for your reply. I wish you success in your studies and a happy life.

islambarakat99 commented 1 year ago

I wish you all the luck in your thesis!

syf980302 commented 1 year ago

Hello, I'm sorry to bother you again. I still have some questions that I need to consult with you and I need your help.

Question 1: You said you set the maximum speed of the intelligent agent to 0.2, but I can't find the specific code for this part.

Question 2: I don't know where the action space is defined, and I don't know where the settings for this part are.

Finally, I'm sorry to bother you again, but I really need your help. I wish you all the best.

浅握双手 @.***

------------------ 原始邮件 ------------------ 发件人: "islambarakat99/Multi-Robot-Formation-Control-using-Deep-Reinforcement-Learning" @.>; 发送时间: 2022年11月16日(星期三) 凌晨4:23 @.>; @.**@.>; 主题: Re: [islambarakat99/Multi-Robot-Formation-Control-using-Deep-Reinforcement-Learning] Questions about reward function setting (Issue #4)

I wish you all the luck in your thesis!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

islambarakat99 / Multi-Robot-Formation-Control-using-Deep-Reinforcement-Learning

Questions about reward function setting #4