opensistemas-hub / osbrain

osBrain - A general-purpose multi-agent system module written in Python
https://osbrain.readthedocs.io/en/stable/
Apache License 2.0
175 stars 43 forks source link

Nameserver crashes unexpectedly #318

Open ocaballeror opened 5 years ago

ocaballeror commented 5 years ago

The nameserver crashed on shutdown and I could not restart it because it was left hanging, waiting for a rogue agent to shut down, which apparently is the expected behavior.

Surprisingly enough the error message shown was:

TimeoutError: Chances are [] were not shutdown after 10.0 s!

So it would appear like the agent was still alive after the call to async_kill_agents but it effectively died in the milliseconds between us checking if it was alive and the TimeoutErrror being raised just after that. I find it very very strange, especially considering that we set a default timeout of 10 seconds, which should be plenty for any kind of agent to shut down.

It probably has something to do with the agent being unresponsive and having broken the connection between it and the nameserver, but it's hard to know for sure until we can get a reproducible case.

ocaballeror commented 5 years ago

We'll have to experiment by making the agents crash in different ways until we find a situation that we can reproduce.

Peque commented 5 years ago

Maybe unrelated, but I was able to reproduce a crash like that (only the list of agents was not empty) in my pypy branch with:

tox -e pypy3 -- -xsv -k close_ipc_socket_agent_blocked
ocaballeror commented 5 years ago

To be fair, there are quite a few things that don't seem to work with pypy, so I'm not sure if this counts as "reproducing" the error.

My guess from a few minutes of running this is that pypy must handle threads in a different way than what we are used to. The ContextTerminated errors that pop up when running this test certainly look like the context is being terminated before we expected.

What is happening on pypy reminds me of this other test I wrote when I first tried to reproduce the error. The agent ends up in a very wrong state, and the output looks kind of similar:

def test_agent_break():
    def break_internals(agent):
        agent._context.term()

    ns = run_nameserver()
    agent = run_agent('agent')
    agent.set_method(break_internals)
    agent.after(0, 'break_internals')
    time.sleep(.1)
    ns.shutdown(3)

    assert agent_dies('agent', ns)

I still haven't found a way to reproduce the original Chances are [] were not shutdown error :disappointed:. There could be many factors involved, but what exactly happened is still beyond me.