utiasDSL / gym-pybullet-drones

PyBullet Gymnasium environments for single and multi-agent reinforcement learning of quadcopter control
https://utiasDSL.github.io/gym-pybullet-drones/
MIT License
1.26k stars 372 forks source link

record the learning process #53

Closed nbenave closed 2 years ago

nbenave commented 3 years ago

Hi there, Very impressive work !

when I run the learn.py I can see the learning process of the quadcopter attempts to fly, however, not all attempts are available , only few. is there anyway to see all attempts? I wish to preview the learning process visually.

Also - is it available on singleagent.py as well ?

Thanks,

JacopoPan commented 3 years ago

Hi @nbenave,

when you run

python  gym-pybullet-drones/examples/learn.py

what you see at the end is a trained model applied to the quadrotor, i.e. line 88: https://github.com/utiasDSL/gym-pybullet-drones/blob/c62e67ab2dca8580e907ec45f95b1e24eba0bd0e/examples/learn.py#L88 the resulting performance is not great because learn.py is an example script that learns over "only" 10000 steps https://github.com/utiasDSL/gym-pybullet-drones/blob/c62e67ab2dca8580e907ec45f95b1e24eba0bd0e/examples/learn.py#L56 if you want to look at those 10000 steps, you only need to change this line https://github.com/utiasDSL/gym-pybullet-drones/blob/c62e67ab2dca8580e907ec45f95b1e24eba0bd0e/examples/learn.py#L42 to

env = gym.make("takeoff-aviary-v0", gui=True) 

however, I think you'll realize that adding the frontend and rendering can make the learning prohibitively time consuming

in singleagent.py I used stable-baselines3's EvalCallback to save a model every time it improves performance https://github.com/utiasDSL/gym-pybullet-drones/blob/c62e67ab2dca8580e907ec45f95b1e24eba0bd0e/experiments/learning/singleagent.py#L235 you might want to do something similar to visualize how the agent changes during learning "offline"

nbenave commented 3 years ago

Thank you for your quick and detailed answer!

in the Show performance code section (line 72-101) the environment will display the model top performance ?

I have a few short questions if you can clarify few things

  1. 10,000 timestamps equivalent to 10 seconds of training ?
  2. The reward is changing from -200 at the initial steps and can reach to about -20, what is the optimal reward ? what is this numeric value is representing in this environment ?
  3. in line 81 , the range of the for loop is range(3*env.SIM_FERQ) , can you explain why iterating over SIM_FREQ ? and why multiply by 3 ?

Thanks again.

JacopoPan commented 3 years ago

Briefly

nbenave commented 3 years ago

Thank you again mate. now its more clear for me :)

Another question about the multi-agent learning. The training is for both of the quad-copter? each quadcopter is training separately ? Each of them observe simultaneously ? or there's a joint observation ?

The reward in each step related to the follower / the leader / or both of them ?

Thanks!

JacopoPan commented 3 years ago

The MARL example in multiagent.py is based on the centralized critic examples of RLlib so yes, both agents learn, there is some postprocessing that goes into creating the observations of each agent and each agent has its own reward signal. The multi-agent script, in my intention, was meant as a demonstration of how a multi-agent environment can be used. The best way to do MARL is still a bit up for debate, imho.

nbenave commented 3 years ago

Thanks again, how the reward is calculated in multiagent learning ?

there's a reward for drone 0 , and reward for drone 1, but I dont understand how you calculate the overall reward.

There's an equation for calculating these two rewards into one overall reward ?

I'm using tensorboard and the mean-reward graph displays only one parameter, and not for two drones.

JacopoPan commented 3 years ago

The multi-agent aviary returns a dictionary of rewards because each agent can receive its own signal. How to use these to learn multiple critics/value functions depends on the MARL approach you are implementing (see parameter sharing vs. fully independent learning vs. centralized critic, etc.). Off the top of my head, I don't remember what value you'd see on TB.