Open rwightman opened 6 years ago
Note the vectored setup is based on OpenAI vectored environments and has been used in the same state with other simulators and absolutely no such issues were observed.
There were some memory leaks in OpenSim and I'm not sure if that's already fixed. It might be related. @carmichaelong @chrisdembia do you know anything that can cause these problems? @rwightman do you have some sample code that reproduces the error?
I can't think of anything off the top of my head. We would have to see the code.
@rwightman
When I have a poor model, that falls from the starting point, then the OpenSim does the steps in less then a second.
However, when the model is better, there are steps that can take much longer, they take more than several seconds. It seems these are the steps in which there are more objects colliding. E.g. one leg going inside an obstacle or the ground.
On the memory, some of us experienced memory leaks in this setup: https://github.com/stanfordnmbl/osim-rl/issues/10
I have the same issue. There still is a leak causing the env to take up over a GB in each subprocess after running for a day or so. I installed the environment on Scientific Linux 7.3, Ubuntu 16.04 and Arch following the docs and they all exhibit the same issue.
Trying to work around it similar to #58, i.e. destroying the subprocs after a whiles, causes random seg faults. So I ended up wrapping it in a bash script that restarts the python script every couple of hours.
Running the memleak.py from #10 I can see it taking up more and more memory. Here is just the start of it but it continues to take up more and more memory:
Updating Model file from 30000 to latest format...
Loaded model gait14dof22musc_pros from file /home/fg/Projects/NIPS2018/osim-rl/osim/env/../models/gait14dof22musc_pros_20180507.osim
Model 'gait14dof22musc_pros' has subcomponents with duplicate name 'back'.
The duplicate is being renamed to 'back_0'.
Model 'gait14dof22musc_pros' has subcomponents with duplicate name 'pros_foot_r'.
The duplicate is being renamed to 'pros_foot_r_0'.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
episode:0 step:98 memory_uasge:129691648
episode:1 step:198 memory_uasge:130232320
episode:2 step:297 memory_uasge:131043328
episode:3 step:397 memory_uasge:131313664
episode:4 step:495 memory_uasge:131854336
episode:5 step:598 memory_uasge:133476352
episode:6 step:698 memory_uasge:133476352
episode:7 step:797 memory_uasge:133746688
episode:8 step:898 memory_uasge:134017024
episode:9 step:998 memory_uasge:141045760
episode:10 step:1098 memory_uasge:141045760
episode:11 step:1197 memory_uasge:141045760
episode:12 step:1296 memory_uasge:141045760
episode:13 step:1396 memory_uasge:141045760
episode:14 step:1496 memory_uasge:141045760
episode:15 step:1596 memory_uasge:141045760
episode:16 step:1696 memory_uasge:141045760
episode:17 step:1795 memory_uasge:141045760
episode:18 step:1896 memory_uasge:141045760
episode:19 step:1995 memory_uasge:153481216
episode:20 step:2095 memory_uasge:153481216
episode:21 step:2193 memory_uasge:153481216
episode:22 step:2295 memory_uasge:153481216
episode:23 step:2397 memory_uasge:153481216
episode:24 step:2496 memory_uasge:153481216
episode:25 step:2596 memory_uasge:153481216
episode:26 step:2697 memory_uasge:153481216
episode:27 step:2795 memory_uasge:153481216
episode:28 step:2894 memory_uasge:153481216
episode:29 step:2992 memory_uasge:153481216
episode:30 step:3092 memory_uasge:153481216
episode:31 step:3191 memory_uasge:153481216
episode:32 step:3291 memory_uasge:153481216
episode:33 step:3392 memory_uasge:153481216
episode:34 step:3493 memory_uasge:153481216
episode:35 step:3593 memory_uasge:153481216
episode:36 step:3693 memory_uasge:153481216
episode:37 step:3793 memory_uasge:153481216
episode:38 step:3894 memory_uasge:153481216
episode:39 step:3993 memory_uasge:174661632
episode:40 step:4093 memory_uasge:174661632
episode:41 step:4193 memory_uasge:174661632
episode:42 step:4292 memory_uasge:174661632
episode:43 step:4392 memory_uasge:174661632
episode:44 step:4492 memory_uasge:174661632
episode:45 step:4594 memory_uasge:174661632
episode:46 step:4692 memory_uasge:174661632
episode:47 step:4794 memory_uasge:174661632
episode:48 step:4894 memory_uasge:174661632
episode:49 step:4994 memory_uasge:174661632
episode:50 step:5093 memory_uasge:174661632
episode:51 step:5192 memory_uasge:174661632
episode:52 step:5292 memory_uasge:174661632
episode:53 step:5390 memory_uasge:174661632
episode:54 step:5489 memory_uasge:174661632
episode:55 step:5588 memory_uasge:174661632
episode:56 step:5688 memory_uasge:174661632
episode:57 step:5789 memory_uasge:174661632
episode:58 step:5889 memory_uasge:174661632
episode:59 step:5989 memory_uasge:174661632
episode:60 step:6087 memory_uasge:174661632
episode:61 step:6185 memory_uasge:174661632
episode:62 step:6286 memory_uasge:174661632
episode:63 step:6385 memory_uasge:174661632
episode:64 step:6484 memory_uasge:174661632
episode:65 step:6583 memory_uasge:174661632
episode:66 step:6683 memory_uasge:174661632
episode:67 step:6782 memory_uasge:174661632
episode:68 step:6881 memory_uasge:174661632
episode:69 step:6981 memory_uasge:174661632
episode:70 step:7081 memory_uasge:174661632
episode:71 step:7181 memory_uasge:174661632
episode:72 step:7280 memory_uasge:174661632
episode:73 step:7381 memory_uasge:174661632
episode:74 step:7481 memory_uasge:174661632
episode:75 step:7582 memory_uasge:174661632
episode:76 step:7681 memory_uasge:174661632
episode:77 step:7781 memory_uasge:174661632
episode:78 step:7881 memory_uasge:174927872
episode:79 step:7979 memory_uasge:220528640
episode:80 step:8079 memory_uasge:220528640
episode:81 step:8179 memory_uasge:220528640
episode:82 step:8277 memory_uasge:220528640
episode:83 step:8378 memory_uasge:220528640
episode:84 step:8476 memory_uasge:220528640
episode:85 step:8577 memory_uasge:220528640
episode:86 step:8676 memory_uasge:220528640
episode:87 step:8775 memory_uasge:220528640
episode:88 step:8874 memory_uasge:220528640
episode:89 step:8974 memory_uasge:220528640
episode:90 step:9074 memory_uasge:220528640
episode:91 step:9173 memory_uasge:220528640
episode:92 step:9271 memory_uasge:220528640
episode:93 step:9370 memory_uasge:220528640
episode:94 step:9469 memory_uasge:220528640
episode:95 step:9569 memory_uasge:220528640
episode:96 step:9670 memory_uasge:220528640
About the time per step - yes, it will wary a lot depending on the position/velocity of the skeleton. Especially if it manages to push the foot through the ground it seem like the physics solver goes nuts. I have had a single step take over 20 minutes. That's my timeout value and after that the episode is aborted.
Free idea.
Penalty for longer steps, reward shaping. So that the model prefers not to be stuck with legs deep in the ground.
Thanks some helpful comments. So to summarize:
a high degree of variance in step times is expected depending on what's happening in the sim. Like mattias, I have also observed some steps taking multiple minutes . That threw me for a loop because it was unexpected.
There is a memory leak that is not related to step variation, but I've likely observed it in combination with the environments that are stepping (slowly) through more complex interactions.
About the time per step - yes, it will wary a lot depending on the position/velocity of the skeleton. Especially if it manages to push the foot through the ground it seem like the physics solver goes nuts. I have had a single step take over 20 minutes. That's my timeout value and after that the episode is aborted.
Just to complement this answer. The numerical integration is slowed down when one applied large forces. For the numerical integration to be accurate the solver must take small steps to satisfy the accuracy tolerance. This can happen if we apply large torques at the joints. If the model contains muscles, this can happen when the model is in a bad configuration (e.g., knee over-extension). Also, if the model contains passive forces to prevent the range of motion, it is possible that our commands try to violate them and as a consequence large passive forces are applied. A quick solution is to monitor the forces applied or the passive forces. If their value is large then you can terminate the simulation since it will probably result in the model doing non-physiological movements.
[I don't know the current state of the code and this describes the situation in Oct 2018.]
Your suggestion wouldn't have been possible in our use-case because the model was defined by someone else and we trained a RL agent that tried to find the best motion solution. Terminating our training would (probably) have created an agent incapable of handling certain ranges of motion when trialed later. Adding penalties to training resulted in worse performance (in our agent). However, lowering the timeout value was helpful since it usually didn't matter if a step took 10 or 20 minutes to timeout.
Again, the worst cases was when the model (a human skeleton) managed to push its feet through the static ground. This could happen after a fall (a long step/jump), or sometimes through muscle forces by pushing down.
I've been noticing some strange behaviour running OpenSim in a synchronous vectored environment.
Initially, the 8-16 environments step evenly with roughly equal CPU utilizations and step times across the environments seem roughly equal.
After running a training session like this for some time, usually after a significant fraction of a day, once trajectories start getting longer, I notice a few of the processes start to significantly lag on a per-step basis. There will usually be one or two 'problem' simulator processes that seem to consistently take longer than all the others for a step. They completely obliterate performance and drag the avg step FPS down significantly.
I'm in the progress of instrumenting the OpenSim child process side of this, but curious if there are trajectories/interactions in the simulator where this slowdown is expected? I've also noticed that the simulator processes that start exhibiting this behaviour have higher memory consumption than the ones that are still stepping 'normally' and they appear to remain problematic as training continues.