utiasDSL / gym-pybullet-drones

PyBullet Gymnasium environments for single and multi-agent reinforcement learning of quadcopter control
https://utiasDSL.github.io/gym-pybullet-drones/
MIT License
1.15k stars 337 forks source link

Have you ever comfirm controlling one drone with "rpm" using learn.py ? #180

Closed paehal closed 6 months ago

paehal commented 7 months ago

Hello, it's been a long while.

I haven't touched this repository much lately, but I'm glad to see that there has been a lot of progress.

I have one question, and this is something I had trouble with in previous version. Have you ever seen a configuration where "rpm" is sufficient for drone to learn its own policies instead of "one_d_rpm"?

I think "rpm" is still more difficult to control. However, I believe that "rpm" control is a necessary setting for a drone flying around in 3D space.

Best regards,

JacopoPan commented 7 months ago

Hi @paehal , I trained stable-baseline3 PPO to do hover with just RPMs (in the plus/minus 5% range of the hover value) back in 2020 without yaw control (as it wasn't penalized in the reward). I agree it's a more difficult RL problem and that's why the base RL aviary class includes simplified actions spaces for the 1D and the velocity control cases.

https://github.com/utiasDSL/gym-pybullet-drones/assets/19269261/89ee249e-9738-49c0-921c-b00cc74718be

This was a 4 layer architecture [256, 256, 256, 128, 2 shared 2 separate for qf and pol], with a 12 vector input [position, ori, vel, ang_vel] to 4 motor velocities (in the +-5% RPMs around the hover RPMs) after 8 hours and ~5M time steps (48Hz ctrl).

paehal commented 7 months ago

@JacopoPan

Thanks for the reply and sharing the video. Glad to hear that rpm control has been stable in the past.

I would like to do a study under the same conditions as yours in the latest repository, is that possible?

Here is what I am wondering.

Do I just run "python learn.py" with action type as rpm? Do I need to set up a new action that does not control yaw? Do I also need to change the reward settings?

JacopoPan commented 7 months ago

Do I just run "python learn.py" with action type as rpm?

yes

Do I need to set up a new action that does not control yaw?

no, the action will be a vector of size 4 with the desired RPMs (in fact a plus/minus 5% centered in the hover RPMs) of each motor

Do I also need to change the reward settings?

What is mainly different in the current HoverAviary is that the reward is always positive (instead of including negative penalties), it is only based on position (the result above also included a reward component based on the velocity) and the environment does not early terminate if the quadrotor flips or flies out of bound. It might be necessary to reintroduce some of those details.

paehal commented 7 months ago

the environment does not early terminate if the quadrotor flips or flies out of bound

Let me confirm. In latest repository, does the environment terminate if the quadrotor flips or flies out of bound? If so, how to change the simulation setting?

JacopoPan commented 7 months ago

No, you can add that to the

https://github.com/utiasDSL/gym-pybullet-drones/blob/9a9ca8ac1c44b9131dd524bec63843e236a734d2/gym_pybullet_drones/envs/HoverAviary.py#L100

method

(FYI, that the reward achieved by a "successful" one-dimensional hover is ~470 (in 3' on my machine), I just tried training the 3D hover, as is, for ~30' and it stopped at a reward of ~250).

JacopoPan commented 7 months ago

Hi @paehal

I added back the truncation condition and trained this in ~10' (this is the current code in main)

https://github.com/utiasDSL/gym-pybullet-drones/assets/19269261/9e1edda2-380c-4c19-bbe0-c3dab01e5b58

paehal commented 6 months ago

@JacopoPan Thank you for your response, it was very informative. I tried training in a similar way and obtained the following results. (Although the training time was different, I believe the results are quite close to yours.)

image

Related to this, I have a question: how can I load a trained model in a different job and save a video of its performance? Even setting --record_video to True, the video is not being saved. Also, when I tried to load a different trained model with the following settings, targeting a model in a specified folder, an error occurred. Since I'm not familiar with stable_baseline3, I would appreciate if you could help me identify the cause.

if resume and os.path.isfile(filename+'/best_model.zip'): path = filename+'/best_model.zip' model = PPO.load(path) print("Resume Model Complete")

[Error content] python3.10/site-packages/stable_baselines3/common/base_class.py", line 422, in _setup_learn assert self.env is not None AssertionError

In a previous version, there was something like test_learning.py, which, when executed, allowed me to verify the behavior in a video.

JacopoPan commented 6 months ago

The current version of script gym_pybullet_drones/examples/learn.py does include re-loading the model and rendering it's performance, you should be able do what you desire by modifying it (I would guess your error arises from not having initialized a PPO model with the target environment before loading the trained model but I haven't encountered it myself).

paehal commented 6 months ago

@JacopoPan

Quick response, thank you. I was able to understand what you were saying by carefully reading the code. I confirmed that the evaluation is working for the first time after training. I was able to achieve this by making some changes to the code since I wanted to run a pretrained model without retraining it.

Also, this is a different question, but (please let me know if it's better to create a separate issue), I believe that increasing the control_freq generally improves control (e.g., Hovering). So, here are the following questions:

  1. Is control_freq the same as the frequency of obtaining observations?
  2. Are there any key points that need to be changed as learning conditions when increasing control_freq? I think I probably need to increase gamma, but I'd like to know if there are any other adjustments I should make.
JacopoPan commented 6 months ago

Ctrl freq is both the frequency at which observations are produced and actions are taken by the environment. (Sim freq is the frequency at which the PyBullet step is called, normally greater than ctrl freq).

The main thing to note is that the observation contains the actions of the last .5 seconds, so increasing the ctrl freq will increase the obs space.

paehal commented 6 months ago

Thank you for your reply.

Ctrl freq is both the frequency at which observations are produced and actions are taken by the environment.

My understanding aligns with this, which is great. Is it also correct to say that this PyBullet step is responsible for the actual physics simulation?

The main thing to note is that the observation contains the actions of the last .5 seconds, so increasing the ctrl freq will increase the obs space.

This corresponds to the following part in the code, right? self.ACTION_BUFFER_SIZE = int(ctrl_freq//2)

I'm asking out of curiosity, but where did the idea of using actions from the last 0.5 seconds as observations come from? Was it from a paper or some other source?

Additionally, if I want to change the MLP network model when increasing ctrl_freq because the last buffer action becomes too large, would the following setup be appropriate? Have you had any experience with changing the MLP network structure in a similar situation?

# Define policy network
class CustomPolicy(ActorCriticPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[256, 256])

# Make PPO model using policy network
model = PPO(CustomPolicy,
            DummyVecEnv([train_env]),
            verbose=1)
JacopoPan commented 6 months ago

The sim/pybullet frequency is the actual physics integration frequency, yes.

The idea of the action buffer is that the policy might be better guided by knowing what the controller had done just before, the proportionality to the control frequency makes it dependent on the wall-clock only, and not the type of controller (but it might be appropriate to change that, depending on application).

For custom SB3 policies, I can only refer you to the relative documentation https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

I used different critic/actor network sizes in past SB3 versions but the current focus of this repo is having very few dependencies and compatibility with the simplest/most stock versions of them.

paehal commented 6 months ago

@JacopoPan Thank you for your comment. I have tried several experiments since last week, but it seems that entering the actions taken in the previous step leads to unstable learning as a conclusion. Although I haven't fully learned the control at 240Hz yet, I plan to try out various conditions in the future. If I have any further questions, I will ask.

zcase commented 3 months ago

@JacopoPan how did you calculate the 470 for a "successful" training or value for a successful hover?