stanfordnmbl / osim-rl

Reinforcement learning environments with musculoskeletal models
http://osim-rl.stanford.edu/
MIT License
882 stars 249 forks source link

Actions not being clipped by OpenSim or osim-rl (redux) #178

Closed rwightman closed 5 years ago

rwightman commented 5 years ago

So, this seems kinda silly. I spent some decent time investigating this and just found #64. However, the clipping applied in #64 is not on the branch used in the latest competition. Why didn't those changes make it to the latest code? The problem still exists and there is significant impact.

OpenSim (still) accepts values outside of the range [0, 1.0], most continuous RL setups output values outside of that range as the general assumption for most commonly used environments is that they handle the clipping needed.

I came across this issue in the first place experimenting with a different impl of a similar algo that did actually clip the actions to the specified action space. The clipping is done completely outside of my RL model so it's not impacting any gradients or other aspects of the algorithms. However, clipping makes the learning progress for the algo significantly different. I confirmed by clipping in my original setup and it matched my new experiment. Poking around the sim further, the behaviour is definitely different. With actuations outside of the specified [0, 1] range, the state coming back is noticeably different in the distribution of the returned activations and forces, even in the early steps of a sim.

AdamStelmaszczyk commented 5 years ago

@rwightman is right:

https://github.com/stanfordnmbl/osim-rl/blob/master/osim/env/osim.py#L96

kidzik commented 5 years ago

@carmichaelong do you know if this has changed in OpenSim 4.0? or what's the current expected behavior?

carmichaelong commented 5 years ago

Depending on the timing of the OpenSim distribution used for last year's competition, there's a chance something might have changed, however, the change should be STRICTER now. The most relevant PR I could think of was from over a year ago: https://github.com/opensim-org/opensim-core/pull/1548

The expected behavior after this PR should be: if a muscle receives an excitation outside of its min/max range (which should be set to 0/1 in the model file), then an Exception is thrown. So, the user (in this case, osim-rl) should clip or check the values before. There still seems to be a problem on the OpenSim side if, indeed, no Exception is thrown about an out of bounds excitation value.

rwightman commented 5 years ago

Unless exceptions are being caught and ignored somewhere I don't see this behaviour. Passing unclipped actuations to OpenSim does not result in an exception, but it does result in different behaviour. I see a very different distribution of the activations depending on whether my actuations are clipped or unclipped.

With unclipped actuations I see a full range of activations in the early steps of simulation, with values from 0.01 to 1.0. With clipped actuation, the resulting activations are approx in the 0.15-.2 to 0.75-0.8 range. The resulting motions and learning progress are also very different. I'm clipping right before the environment step, outside of any usage in the learning algorithm so the clipping should not impact the learning itself whatsoever if the simulator was handling the clipping internally.

rwightman commented 5 years ago

Also I did look at opensim-org/opensim-core#1548 for a bit, and didn't notice any actual check of the input range violating min/maxControl resulting in a throw.

carmichaelong commented 5 years ago

@rwightman Looks like you're right. i searched the current master branch for anything like this too and didn't find anything. It did jog my memory, though, and apparently I opened up an issue about this awhile ago: https://github.com/opensim-org/opensim-core/issues/2035

I'll try to bump the issue and possibly add some minimal reproducible examples to highlight the issue.

rwightman commented 5 years ago

I fiddled with this a bit more in some free time.

An RL policy trained without the clipping behaves differently at eval time if clipping is later applied. From a decent policy where the agents makes straight forward progress and doesn't fall over in the time limit, the clipped version still manages to remain upright but progress is impeded, the speed and direction of the forward progress differs.

This highlights that the simulator definitely produces different behaviour depending on whether the inputs are clipped or not and this will impact the challenge if clipping behaviour in training and eval are not consistent.

Poddiachyi commented 5 years ago

@rwightman @carmichaelong Have you managed to fix it?

I have the same issue. Tried to fix it using #65 but no luck.

Training gets faster (I mean episodes run faster somehow) but reward is much worse than when there is no clipping. I mean when I tried without clipping, an agent did learn something and was getting better. But with clipping it's kind of stuck (or even getting worse).

rwightman commented 5 years ago

@Poddiachyi I've observed the same thing. The reward curve rises much faster and more consistently (with PPO at least) when the actions aren't clipped. It'd be interesting to know how the unclipped actions impact the simulator to make this the case.

I have some experiments running with naively clipped gaussian vs a beta distribution. The beta appears to be winning but they're both proceeding much more slowly in reward gains so too early to make a conclusion.

mattiasljungstrom commented 5 years ago

For what it's worth I've trained my agent with clipped actions. Maybe your unclipped version leads to better exploration, and therefore faster learning?

rwightman commented 5 years ago

@mattiasljungstrom yeah, not saying it's impossible to train an agent with clipped actions, it's just taking much longer :) I have a few clipping experiments running, but haven't achieved fast enough progress in any of them to match my best unclipped results yet. Looks like it'll be a few more days at this rate.

RchalYang commented 5 years ago

I think adding clipping in environment is a huge change in this challenge, and it's kind of weird that the environment changes when the challenge is closing. I spent hours to find out why my previous model can not get the same performance as it does yesterday. Hope others spend less time on this problem than i did.