opensistemas-hub / osbrain

osBrain - A general-purpose multi-agent system module written in Python
https://osbrain.readthedocs.io/en/stable/
Apache License 2.0
176 stars 43 forks source link

Socket file removal tests randomly failing on Travis #287

Closed ocaballeror closed 6 years ago

ocaballeror commented 6 years ago

https://travis-ci.org/opensistemas-hub/osbrain/jobs/362494498 https://travis-ci.org/ocaballeror/osbrain/jobs/362021572 https://travis-ci.org/ocaballeror/osbrain/jobs/362012690

Occurrences

Occurrences Test
1 test_agent_close_ipc_socket_agent_blocked_nameserver_shutdown
1 test_agent_close_ipc_socket_agent_crash_nameserver_shutdown
3 test_agent_close_ipc_socket_agent_kill
Peque commented 6 years ago

Yeah, I just found another one. Added a table to keep track of occurrence counts.

For now it seems the failure is always when the agent is killed or crashes (if the agent is blocked, then the name server will actually kill it, so it is the same).

ocaballeror commented 6 years ago

I also checked my ~/.osbrain_ipc/ folder and it's full of socket files, so something is definitely wrong (and has been for a while).

Peque commented 6 years ago

@ocaballeror When were they last modified? We used to have that problem before d76a1ea88c615f45f2e1156684f02176a0a1d007.

Otherwise it might mean you can actually reproduce what we are seeing in Travis. :smile:

ocaballeror commented 6 years ago

I have a few ones from February 2-7, and the next ones were created on March 27th, so that could be a good clue of where to find a possible bug.

ocaballeror commented 6 years ago

I just realized I can get osbrain to keep the socket files by killing the main process with ^C before it has time to be shut down cleanly, which makes sense, but it's something we should address in #249 before it's merged.

That is probably the reason why I had so many leftover files in my system, and that also makes me think that the one socket file I supposedly got to keep on purpose by running the tests over and over was actually the result of a manually killed process. That's bad news, because it means I have not been able to reproduce the error locally, and that this issue will be a pain to debug :sweat:

ocaballeror commented 6 years ago

I left the tests running overnight, with 4 simultaneously instances of pytest running agent_close_ipc_socket_agent_kill over and over in both my home PC and my VPS (single core machine, 4 instances of pytest should be enough to overload the CPU), and after a total of more than 300,000 runs, not a single execution failed :sob: :sob: . I'm kinda giving up on reproducing the error, at least by conventional means.

I also tried with different values of linger, but none of them seemed to "work", since the socket files were deleted when the name server is shut down, no matter the extra linger time that was set.

The only modification I can think of is to add an extra check after closing the socket, to see if the file is still there, and remove it if it is. Something like (pseudo code):

def remove_socket_file(socket):
  if os.path.exists("whatever the path of the socket file is"):
    os.unlink("whatever the path of the socket file is")

class Agent:
  ...
  def close_all(self):
  ...
    sock.close(linger=get_linger())
    self.after(linger, remove_socket_file, sock)
Peque commented 6 years ago

I was able to reproduce it: #293

Working on a fix...