utiasDSL / gym-pybullet-drones

PyBullet Gymnasium environments for single and multi-agent reinforcement learning of quadcopter control
https://utiasDSL.github.io/gym-pybullet-drones/
MIT License
1.22k stars 356 forks source link

[QUESTION] Training complex scenarios #17

Closed chris-aeviator closed 3 years ago

chris-aeviator commented 3 years ago

When using hoverAviary with singleagent I can perfectly simulate the training of missionpoints by parameters like MAX_Z (max height to reach) or MAX_LIN_VEL_Z (how fast to move up).

I can see the RPM curves beeing perfectly smooth for my mission goal with less than 10min of training, awesome!

Now I try to understand how to proceed from here to more complex scenarios. I introduced the concept of wind, therefore I applied a constant Y directed force of 0.04 for every step in BaseAviary with p.applyExternalForce.

While this seems to do the job for the wind simulation when running fly.py, but when training on this I cannot see any RPM variations beeing tried by the training

image

The UAV consequently "moves with the wind"

image

My question is: how can I add variation to the Motor RPM in the training steps so I will reach a wind compensating controller like described in this example.

When running my training with ppo I get a variation in RPM, but all synced on all rotors, so no force and momentum is created to counteract the wind.

image

P.S. this toolset gives one of the best experiences on working with UAV sim, I can manipulate my UAV, see how it flies with fly.py, I can train a controller and see it's progress while it's training with test_singleagent.py and all your code is well structured and documented, big thanks

EDIT:

I might mess up the wind idea, I've added

        # wind
        p.applyExternalForce(self.DRONE_IDS[nth_drone],
                             -1,
                             forceObj=[0, 0.10, 0],
                             posObj=[0, 0, 0],
                             flags=p.LINK_FRAME,
                             physicsClientId=self.CLIENT
                             )

inside BaseAviary just before the propeller forces are added

EDIT2: I made sure to constrain the mission by setting

MAX_XY=0.5
MAX_LIN_VEL_XY=0.4
MAX_Z = 1
JacopoPan commented 3 years ago

Thank you again @chris-aeviator, the whole repo is still partially work-in-progress and every feedback is appreciated!

Just a few remarks to make sure we are on the same page:

I guess that your problem arises from the fact that the default ActionType used in singleagent.py is one_d_rpm https://github.com/utiasDSL/gym-pybullet-drones/blob/bf173d0e87f26ed197fdf7e277730fd189d58f26/experiments/learning/singleagent.py#L60 I.e., the RPM of all propellers are the same in the learned controller: you would want to change that to just rpm to apply 4 different actions to the propellers. I have to warn you that that's where the learning problem gets a lot more complicated (even a simple hover might take hours, rather than just minutes to learn).

chris-aeviator commented 3 years ago

@JacopoPan thanks for your explanation, yes we are on the same line. Since I'm evaluating an airframe design I've mentioned fly.py to test and compare certain parameters and also evaluated the wind.

I'm ok training for hrs and this is expected, with the ´one_d_rpm` setting (thanks for making that clear!) my loss converged against around -5576 and I stopped the training after 3 hrs. I will now try to understand and implement the necessary separate actions for each of the propellers.

EDIT: you are saying I will just need to set --act rpm to achieve the 4 rotors being processed independently?

EDIT 2: it seems like so :+1: :rocket: :partying_face:

image

EDIT 3: for anybody interested in my usecase: after 1 hr of training with a2c I still got really poor results (not even flying), so I switched to PPO and it shows a much bigger GPU utilization (up to 75% compared to max 20 on A2C) and seems to perform way better, the time before the craft crashes is 1.5 s (A2C) after 1 hr of training and about 4.5 seconds after about 15 min of training with --alg ppo. Even though I've set --cpu 12 I can only see one core being utilized. I'll keep posting results here.

GPU Utilization --alg ppo

image

GPU Utilization --alg a2c

steady 20%

EDIT 4:

Training 1hr 20 min in

Vehicle manages to counteract the Y-directed constant wind force though desired Z-position (1) is not reached yet

https://user-images.githubusercontent.com/11522213/104123195-ef73e500-5349-11eb-90de-a566912f10ac.mp4

JacopoPan commented 3 years ago

Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in singleagent.py. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).

The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D

This is an example of hover that was learned by PPO over ~8hrs.

https://user-images.githubusercontent.com/19269261/104129163-85534400-5339-11eb-854f-7ae142c268a0.mp4

chris-aeviator commented 3 years ago

In your video the vehicle has a crazy spin around the Z axis when hovering, my training vehicle is tending to flip over (no new best reward for 2 hrs), would

Reward shaping

mean "punishing it" for this behaviour? Would I for example apply a reward-decreasing factor when experiencing high angular velocities or can I even punish no-go scenarios like flipping over with a -1 ?

JacopoPan commented 3 years ago

Yes, if the reward function does not account for yaw or the z-axis turn rate (as it's the case in that example), the learning agent cannot distinguish between a spinning and non-spinning hover.

You can try to speed-up/"guide" your learning agent by customizing _computeReward()—the reward of a given state or state-action pair—and _computeDone() —the conditions for terminating an episode.

chris-aeviator commented 3 years ago

Thanks Jacopo for all your help and responsiveness! I'll plan this next steps and leave more findings in this GH issue within the next days, please feel free to close it (or not) :0 .

JacopoPan commented 3 years ago

👌note that I don't think there's a silver bullet for those implementations and every use case is very much of general interest (I'll keep the issue open).

rogerscristo commented 3 years ago

Hi @JacopoPan and @chris-aeviator . I've been following this issue and it helped me a lot to understand the simulator better. Regarding training the hover task with PPO: @JacopoPan, can you please provide the hyperparameters and reward to achieve such good results? I've tried to train using default parameters over 50 million timesteps but did not reach anything like your video. Also, I've tried some reward shaping, as well as tuning PPO parameters but again without success.

Thank you in advance!

Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in singleagent.py. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).

The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D

This is an example of hover that was learned by PPO over ~8hrs.

video-10.28.2020_09.45.37.mp4 Download

JacopoPan commented 3 years ago

@rogerscristo I didn't tag the commit that result came out but I remember it was one of those I obtained when I was testing sa_script.bash and sa_script.slrm on the computing cluster (for 8+ hrs). I don't think the reward has changed much (even if I tried a few variations of it, it has always been either a stepwise function, a distance, or a quadratic distance along the z axis.) Originally I had 256 instead of 512 units in the first layer of the networks. I did not touch any other hyperparameter.

I don't think it should be too surprising if some of the training runs do not succeed: during that set of experiments only PPO and SAC produced "decent" policies. My general suggestion is to start a few experiments in parallel and make sure that the network capacities are appropriate for the task at hand by checking that the learning curves are somewhat stable.

rogerscristo commented 3 years ago

Thank you @JacopoPan for the directions. I will try to generate new learning examples to complement the documentation.

Thanks a lot!

4ku commented 1 year ago

@rogerscristo @chris-aeviator Do you solve this problem? I also have some problems with training. I am trying to train PPO from stable_baselines3 but I don't have any good result.