Closed ocaballeror closed 6 years ago
Yeah, I just found another one. Added a table to keep track of occurrence counts.
For now it seems the failure is always when the agent is killed or crashes (if the agent is blocked, then the name server will actually kill it, so it is the same).
I also checked my ~/.osbrain_ipc/
folder and it's full of socket files, so something is definitely wrong (and has been for a while).
@ocaballeror When were they last modified? We used to have that problem before d76a1ea88c615f45f2e1156684f02176a0a1d007.
Otherwise it might mean you can actually reproduce what we are seeing in Travis. :smile:
I have a few ones from February 2-7, and the next ones were created on March 27th, so that could be a good clue of where to find a possible bug.
I just realized I can get osbrain to keep the socket files by killing the main process with ^C
before it has time to be shut down cleanly, which makes sense, but it's something we should address in #249 before it's merged.
That is probably the reason why I had so many leftover files in my system, and that also makes me think that the one socket file I supposedly got to keep on purpose by running the tests over and over was actually the result of a manually killed process. That's bad news, because it means I have not been able to reproduce the error locally, and that this issue will be a pain to debug :sweat:
I left the tests running overnight, with 4 simultaneously instances of pytest
running agent_close_ipc_socket_agent_kill
over and over in both my home PC and my VPS (single core machine, 4 instances of pytest should be enough to overload the CPU), and after a total of more than 300,000 runs, not a single execution failed :sob: :sob: . I'm kinda giving up on reproducing the error, at least by conventional means.
I also tried with different values of linger
, but none of them seemed to "work", since the socket files were deleted when the name server is shut down, no matter the extra linger time that was set.
The only modification I can think of is to add an extra check after closing the socket, to see if the file is still there, and remove it if it is. Something like (pseudo code):
def remove_socket_file(socket):
if os.path.exists("whatever the path of the socket file is"):
os.unlink("whatever the path of the socket file is")
class Agent:
...
def close_all(self):
...
sock.close(linger=get_linger())
self.after(linger, remove_socket_file, sock)
I was able to reproduce it: #293
Working on a fix...
https://travis-ci.org/opensistemas-hub/osbrain/jobs/362494498 https://travis-ci.org/ocaballeror/osbrain/jobs/362021572 https://travis-ci.org/ocaballeror/osbrain/jobs/362012690
Occurrences
test_agent_close_ipc_socket_agent_blocked_nameserver_shutdown
test_agent_close_ipc_socket_agent_crash_nameserver_shutdown
test_agent_close_ipc_socket_agent_kill