Closed chris-aeviator closed 3 years ago
Thank you again @chris-aeviator, the whole repo is still partially work-in-progress and every feedback is appreciated!
Just a few remarks to make sure we are on the same page:
fly.py
per se does not involve learning, it is purely a flight simulation using PID controlapplyExternalForce
I guess that your problem arises from the fact that the default ActionType
used in singleagent.py
is one_d_rpm
https://github.com/utiasDSL/gym-pybullet-drones/blob/bf173d0e87f26ed197fdf7e277730fd189d58f26/experiments/learning/singleagent.py#L60
I.e., the RPM of all propellers are the same in the learned controller: you would want to change that to just rpm
to apply 4 different actions to the propellers. I have to warn you that that's where the learning problem gets a lot more complicated (even a simple hover might take hours, rather than just minutes to learn).
@JacopoPan thanks for your explanation, yes we are on the same line. Since I'm evaluating an airframe design I've mentioned fly.py
to test and compare certain parameters and also evaluated the wind.
I'm ok training for hrs and this is expected, with the ´one_d_rpm` setting (thanks for making that clear!) my loss converged against around -5576 and I stopped the training after 3 hrs. I will now try to understand and implement the necessary separate actions for each of the propellers.
EDIT: you are saying I will just need to set --act rpm
to achieve the 4 rotors being processed independently?
EDIT 2: it seems like so :+1: :rocket: :partying_face:
EDIT 3: for anybody interested in my usecase: after 1 hr of training with a2c
I still got really poor results (not even flying), so I switched to PPO and it shows a much bigger GPU utilization (up to 75% compared to max 20 on A2C) and seems to perform way better, the time before the craft crashes is 1.5 s (A2C) after 1 hr of training and about 4.5 seconds after about 15 min of training with --alg ppo
. Even though I've set --cpu 12
I can only see one core being utilized. I'll keep posting results here.
--alg ppo
--alg a2c
steady 20%
EDIT 4:
Vehicle manages to counteract the Y-directed constant wind force though desired Z-position (1) is not reached yet
Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in singleagent.py
. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).
The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D
This is an example of hover that was learned by PPO over ~8hrs.
In your video the vehicle has a crazy spin around the Z axis when hovering, my training vehicle is tending to flip over (no new best reward for 2 hrs), would
Reward shaping
mean "punishing it" for this behaviour? Would I for example apply a reward-decreasing factor when experiencing high angular velocities or can I even punish no-go scenarios like flipping over with a -1
?
Yes, if the reward function does not account for yaw or the z-axis turn rate (as it's the case in that example), the learning agent cannot distinguish between a spinning and non-spinning hover.
You can try to speed-up/"guide" your learning agent by customizing _computeReward()
—the reward of a given state or state-action pair—and _computeDone()
—the conditions for terminating an episode.
Thanks Jacopo for all your help and responsiveness! I'll plan this next steps and leave more findings in this GH issue within the next days, please feel free to close it (or not) :0 .
👌note that I don't think there's a silver bullet for those implementations and every use case is very much of general interest (I'll keep the issue open).
Hi @JacopoPan and @chris-aeviator . I've been following this issue and it helped me a lot to understand the simulator better. Regarding training the hover task with PPO: @JacopoPan, can you please provide the hyperparameters and reward to achieve such good results? I've tried to train using default parameters over 50 million timesteps but did not reach anything like your video. Also, I've tried some reward shaping, as well as tuning PPO parameters but again without success.
Thank you in advance!
Yes, not all algorithms might be equally successful. You might want to look at changing the number of neurons and layers in the networks in
singleagent.py
. Reward shaping and trying to limit the range of RPMs for each propeller might also be options (e.g., if your wind is along the global y axis and the quad in the x configuration facing +x, you might simplify the problem commanding the same RPMs to prop 0-1 and 2-3).The goal of this repo is to give you the tools to try all these things, I don't think I've solved the entire problem of addressing generic control with RL yet :D
This is an example of hover that was learned by PPO over ~8hrs.
video-10.28.2020_09.45.37.mp4 Download
@rogerscristo I didn't tag the commit that result came out but I remember it was one of those I obtained when I was testing sa_script.bash
and sa_script.slrm
on the computing cluster (for 8+ hrs). I don't think the reward has changed much (even if I tried a few variations of it, it has always been either a stepwise function, a distance, or a quadratic distance along the z axis.) Originally I had 256 instead of 512 units in the first layer of the networks. I did not touch any other hyperparameter.
I don't think it should be too surprising if some of the training runs do not succeed: during that set of experiments only PPO and SAC produced "decent" policies. My general suggestion is to start a few experiments in parallel and make sure that the network capacities are appropriate for the task at hand by checking that the learning curves are somewhat stable.
Thank you @JacopoPan for the directions. I will try to generate new learning examples to complement the documentation.
Thanks a lot!
@rogerscristo @chris-aeviator Do you solve this problem? I also have some problems with training. I am trying to train PPO from stable_baselines3 but I don't have any good result.
When using hoverAviary with singleagent I can perfectly simulate the training of missionpoints by parameters like MAX_Z (max height to reach) or MAX_LIN_VEL_Z (how fast to move up).
I can see the RPM curves beeing perfectly smooth for my mission goal with less than 10min of training, awesome!
Now I try to understand how to proceed from here to more complex scenarios. I introduced the concept of wind, therefore I applied a constant Y directed force of 0.04 for every step in BaseAviary with
p.applyExternalForce
.While this seems to do the job for the wind simulation when running
fly.py
, but when training on this I cannot see any RPM variations beeing tried by the trainingThe UAV consequently "moves with the wind"
My question is: how can I add variation to the Motor RPM in the training steps so I will reach a wind compensating controller like described in this example.
When running my training with
ppo
I get a variation in RPM, but all synced on all rotors, so no force and momentum is created to counteract the wind.P.S. this toolset gives one of the best experiences on working with UAV sim, I can manipulate my UAV, see how it flies with fly.py, I can train a controller and see it's progress while it's training with test_singleagent.py and all your code is well structured and documented, big thanks
EDIT:
I might mess up the wind idea, I've added
inside
BaseAviary
just before the propeller forces are addedEDIT2: I made sure to constrain the mission by setting