stanfordnmbl / osim-rl

Reinforcement learning environments with musculoskeletal models
http://osim-rl.stanford.edu/
MIT License
882 stars 249 forks source link

Simulator performance degrades over time / drastically uneven step times #176

Open rwightman opened 5 years ago

rwightman commented 5 years ago

I've been noticing some strange behaviour running OpenSim in a synchronous vectored environment.

Initially, the 8-16 environments step evenly with roughly equal CPU utilizations and step times across the environments seem roughly equal.

After running a training session like this for some time, usually after a significant fraction of a day, once trajectories start getting longer, I notice a few of the processes start to significantly lag on a per-step basis. There will usually be one or two 'problem' simulator processes that seem to consistently take longer than all the others for a step. They completely obliterate performance and drag the avg step FPS down significantly.

I'm in the progress of instrumenting the OpenSim child process side of this, but curious if there are trajectories/interactions in the simulator where this slowdown is expected? I've also noticed that the simulator processes that start exhibiting this behaviour have higher memory consumption than the ones that are still stepping 'normally' and they appear to remain problematic as training continues.

rwightman commented 5 years ago

Note the vectored setup is based on OpenAI vectored environments and has been used in the same state with other simulators and absolutely no such issues were observed.

kidzik commented 5 years ago

There were some memory leaks in OpenSim and I'm not sure if that's already fixed. It might be related. @carmichaelong @chrisdembia do you know anything that can cause these problems? @rwightman do you have some sample code that reproduces the error?

chrisdembia commented 5 years ago

I can't think of anything off the top of my head. We would have to see the code.

AdamStelmaszczyk commented 5 years ago

@rwightman

When I have a poor model, that falls from the starting point, then the OpenSim does the steps in less then a second.

However, when the model is better, there are steps that can take much longer, they take more than several seconds. It seems these are the steps in which there are more objects colliding. E.g. one leg going inside an obstacle or the ground.

On the memory, some of us experienced memory leaks in this setup: https://github.com/stanfordnmbl/osim-rl/issues/10

ThGravo commented 5 years ago

I have the same issue. There still is a leak causing the env to take up over a GB in each subprocess after running for a day or so. I installed the environment on Scientific Linux 7.3, Ubuntu 16.04 and Arch following the docs and they all exhibit the same issue.

Trying to work around it similar to #58, i.e. destroying the subprocs after a whiles, causes random seg faults. So I ended up wrapping it in a bash script that restarts the python script every couple of hours.

Running the memleak.py from #10 I can see it taking up more and more memory. Here is just the start of it but it continues to take up more and more memory:

Updating Model file from 30000 to latest format...
Loaded model gait14dof22musc_pros from file /home/fg/Projects/NIPS2018/osim-rl/osim/env/../models/gait14dof22musc_pros_20180507.osim
Model 'gait14dof22musc_pros' has subcomponents with duplicate name 'back'.
The duplicate is being renamed to 'back_0'.
Model 'gait14dof22musc_pros' has subcomponents with duplicate name 'pros_foot_r'.
The duplicate is being renamed to 'pros_foot_r_0'.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
episode:0  step:98   memory_uasge:129691648
episode:1  step:198   memory_uasge:130232320
episode:2  step:297   memory_uasge:131043328
episode:3  step:397   memory_uasge:131313664
episode:4  step:495   memory_uasge:131854336
episode:5  step:598   memory_uasge:133476352
episode:6  step:698   memory_uasge:133476352
episode:7  step:797   memory_uasge:133746688
episode:8  step:898   memory_uasge:134017024
episode:9  step:998   memory_uasge:141045760
episode:10  step:1098   memory_uasge:141045760
episode:11  step:1197   memory_uasge:141045760
episode:12  step:1296   memory_uasge:141045760
episode:13  step:1396   memory_uasge:141045760
episode:14  step:1496   memory_uasge:141045760
episode:15  step:1596   memory_uasge:141045760
episode:16  step:1696   memory_uasge:141045760
episode:17  step:1795   memory_uasge:141045760
episode:18  step:1896   memory_uasge:141045760
episode:19  step:1995   memory_uasge:153481216
episode:20  step:2095   memory_uasge:153481216
episode:21  step:2193   memory_uasge:153481216
episode:22  step:2295   memory_uasge:153481216
episode:23  step:2397   memory_uasge:153481216
episode:24  step:2496   memory_uasge:153481216
episode:25  step:2596   memory_uasge:153481216
episode:26  step:2697   memory_uasge:153481216
episode:27  step:2795   memory_uasge:153481216
episode:28  step:2894   memory_uasge:153481216
episode:29  step:2992   memory_uasge:153481216
episode:30  step:3092   memory_uasge:153481216
episode:31  step:3191   memory_uasge:153481216
episode:32  step:3291   memory_uasge:153481216
episode:33  step:3392   memory_uasge:153481216
episode:34  step:3493   memory_uasge:153481216
episode:35  step:3593   memory_uasge:153481216
episode:36  step:3693   memory_uasge:153481216
episode:37  step:3793   memory_uasge:153481216
episode:38  step:3894   memory_uasge:153481216
episode:39  step:3993   memory_uasge:174661632
episode:40  step:4093   memory_uasge:174661632
episode:41  step:4193   memory_uasge:174661632
episode:42  step:4292   memory_uasge:174661632
episode:43  step:4392   memory_uasge:174661632
episode:44  step:4492   memory_uasge:174661632
episode:45  step:4594   memory_uasge:174661632
episode:46  step:4692   memory_uasge:174661632
episode:47  step:4794   memory_uasge:174661632
episode:48  step:4894   memory_uasge:174661632
episode:49  step:4994   memory_uasge:174661632
episode:50  step:5093   memory_uasge:174661632
episode:51  step:5192   memory_uasge:174661632
episode:52  step:5292   memory_uasge:174661632
episode:53  step:5390   memory_uasge:174661632
episode:54  step:5489   memory_uasge:174661632
episode:55  step:5588   memory_uasge:174661632
episode:56  step:5688   memory_uasge:174661632
episode:57  step:5789   memory_uasge:174661632
episode:58  step:5889   memory_uasge:174661632
episode:59  step:5989   memory_uasge:174661632
episode:60  step:6087   memory_uasge:174661632
episode:61  step:6185   memory_uasge:174661632
episode:62  step:6286   memory_uasge:174661632
episode:63  step:6385   memory_uasge:174661632
episode:64  step:6484   memory_uasge:174661632
episode:65  step:6583   memory_uasge:174661632
episode:66  step:6683   memory_uasge:174661632
episode:67  step:6782   memory_uasge:174661632
episode:68  step:6881   memory_uasge:174661632
episode:69  step:6981   memory_uasge:174661632
episode:70  step:7081   memory_uasge:174661632
episode:71  step:7181   memory_uasge:174661632
episode:72  step:7280   memory_uasge:174661632
episode:73  step:7381   memory_uasge:174661632
episode:74  step:7481   memory_uasge:174661632
episode:75  step:7582   memory_uasge:174661632
episode:76  step:7681   memory_uasge:174661632
episode:77  step:7781   memory_uasge:174661632
episode:78  step:7881   memory_uasge:174927872
episode:79  step:7979   memory_uasge:220528640
episode:80  step:8079   memory_uasge:220528640
episode:81  step:8179   memory_uasge:220528640
episode:82  step:8277   memory_uasge:220528640
episode:83  step:8378   memory_uasge:220528640
episode:84  step:8476   memory_uasge:220528640
episode:85  step:8577   memory_uasge:220528640
episode:86  step:8676   memory_uasge:220528640
episode:87  step:8775   memory_uasge:220528640
episode:88  step:8874   memory_uasge:220528640
episode:89  step:8974   memory_uasge:220528640
episode:90  step:9074   memory_uasge:220528640
episode:91  step:9173   memory_uasge:220528640
episode:92  step:9271   memory_uasge:220528640
episode:93  step:9370   memory_uasge:220528640
episode:94  step:9469   memory_uasge:220528640
episode:95  step:9569   memory_uasge:220528640
episode:96  step:9670   memory_uasge:220528640
mattiasljungstrom commented 5 years ago

About the time per step - yes, it will wary a lot depending on the position/velocity of the skeleton. Especially if it manages to push the foot through the ground it seem like the physics solver goes nuts. I have had a single step take over 20 minutes. That's my timeout value and after that the episode is aborted.

AdamStelmaszczyk commented 5 years ago

Free idea.

Penalty for longer steps, reward shaping. So that the model prefers not to be stuck with legs deep in the ground.

rwightman commented 5 years ago

Thanks some helpful comments. So to summarize:

  1. a high degree of variance in step times is expected depending on what's happening in the sim. Like mattias, I have also observed some steps taking multiple minutes . That threw me for a loop because it was unexpected.

  2. There is a memory leak that is not related to step variation, but I've likely observed it in combination with the environments that are stepping (slowly) through more complex interactions.

mitkof6 commented 4 years ago

About the time per step - yes, it will wary a lot depending on the position/velocity of the skeleton. Especially if it manages to push the foot through the ground it seem like the physics solver goes nuts. I have had a single step take over 20 minutes. That's my timeout value and after that the episode is aborted.

Just to complement this answer. The numerical integration is slowed down when one applied large forces. For the numerical integration to be accurate the solver must take small steps to satisfy the accuracy tolerance. This can happen if we apply large torques at the joints. If the model contains muscles, this can happen when the model is in a bad configuration (e.g., knee over-extension). Also, if the model contains passive forces to prevent the range of motion, it is possible that our commands try to violate them and as a consequence large passive forces are applied. A quick solution is to monitor the forces applied or the passive forces. If their value is large then you can terminate the simulation since it will probably result in the model doing non-physiological movements.

mattiasljungstrom commented 4 years ago

[I don't know the current state of the code and this describes the situation in Oct 2018.]

Your suggestion wouldn't have been possible in our use-case because the model was defined by someone else and we trained a RL agent that tried to find the best motion solution. Terminating our training would (probably) have created an agent incapable of handling certain ranges of motion when trialed later. Adding penalties to training resulted in worse performance (in our agent). However, lowering the timeout value was helpful since it usually didn't matter if a step took 10 or 20 minutes to timeout.

Again, the worst cases was when the model (a human skeleton) managed to push its feet through the static ground. This could happen after a fall (a long step/jump), or sometimes through muscle forces by pushing down.