openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.78k stars 8.61k forks source link

Box2d episode 5: Segfault strikes back #143

Closed justheuristic closed 8 years ago

justheuristic commented 8 years ago

A few ticks ago there was an issue about 2 BipedalWalker-v1 in parallel causing SegFault whence the first was reset() after the second got initialized.

Than there was a fix that finally allowed us to reset several Box2D BipedalWalker-v2 's, but unfortunately we still can't play them.

The new bone of contention looks thusly:

import gym
def segfault(env_name = "BipedalWalker-v2"):
    ll0 = gym.make(env_name)
    ll0.reset()
    ll1 = gym.make(env_name)
    ll1.reset()
    ll0.reset()
    ll1.reset()
    ll1.step(ll1.action_space.sample())
    ll0.step(ll1.action_space.sample())
    print "and nothing happened"

As usual, it works okay with LunarLander, but fails with BipedalWalker

>>> segfault("LunarLander-v2")
[2016-05-30 14:42:40,877] Making new env: LunarLander-v2
libGL error: failed to open drm device: No such file or directory
libGL error: failed to load driver: i965
[2016-05-30 14:42:43,056] Making new env: LunarLander-v2
and nothing happened
>>> segfault("BipedalWalker-v2")
[2016-05-30 14:43:07,731] Making new env: BipedalWalker-v2
[2016-05-30 14:43:07,769] Making new env: BipedalWalker-v2
Segmentation fault

A single bipedal walker works nice and stable in the same setup.

jietang commented 8 years ago

@olegklimov, any ideas?

olegklimov commented 8 years ago

I'll look within a few days.

justheuristic commented 8 years ago

@olegklimov any ideas so far?

olegklimov commented 8 years ago

Right now I'm doing the other thing :( The good news is you can start several environments using that other thing, each one in its own process. Maybe this will help you, to be out soon. But I'll return to this bug anyway, it should be fixed, obviously, crashes are not good.

olegklimov commented 8 years ago

Okey, don't ask me why this fixes anything, because I don't know :) But now I can run your code, in a loop.

jietang commented 8 years ago

I don't really understand what the mechanism behind this fix is. It seems like there are some underlying issues behind how _reset() and contactListeners interact. For example it cropped up as https://github.com/openai/gym/issues/114, and it's now manifesting itself in this different way.

If we're able to repro the segfault, can we dig in and try to understand where it's coming from? Perhaps we're not supposed to directly modify the contactlistener attribute but to use the public interface on b2World instead (http://nullege.com/codes/show/src@b@l@bloodyhell-HEAD@bloodyhell@world@__init__.py/37/Box2D.b2World)?

olegklimov commented 8 years ago

Well, the fix makes life cycle of contactListener longer. The docs here says "be sure your listener remains in scope while the world object exists". It doesn't work directly this way, because we don't destroy world every reset(), but somehow still works.

And the same things don't happen in LunarLander because initially there are no contacts, so nothing happens at all.

olegklimov commented 8 years ago

And mechanism for crash is around this lines: some events enqueue towards contactListener, it gets unreferenced by world and disappears, events cause callbacks in contactListener. Sometimes it gets fancier than this, unreferenced object immediately gets replaced by other unrelated object at same memory location, callback becomes strange unrelated method call.

olegklimov commented 8 years ago

...and they store reference to contactListener in python object in your example, see self._contact_listener = ContactListener()

olegklimov commented 8 years ago

I just found CarRacing also has this bug (fixed), reproducible exactly like in OP.

gdb commented 8 years ago

Is this now closable?

olegklimov commented 8 years ago

Sure. I've observed crash go away. @justheuristic please post more bugs, please reopen this one if it breaks again.

justheuristic commented 8 years ago

Sorry for slow reaction. The bugs are gone indeed. Thank you again for fixing the issue!