Closed MarkKoz closed 1 year ago
I can reproduce this consistently by running
while True:
with run_gunicorn():
...
and waiting a bit. It seems like with run_gunicorn()
can finish before the actual process is dead, allowing a new process to be created which causes an error as the port is still in use by the last process.
run_gunicorn()
terminates the process here:
https://github.com/python-discord/snekbox/blob/9804a10a598b678225d299178113210f74a25392/tests/gunicorn_utils.py#L80
Adding a proc.join()
after this seems to fix the issue.
I also tried proc.join
with a timeout, but couldn't reliably check if it timed out or actually finished terminating afterwards. The docs suggest that you can check exitcode
, but in some cases this was indicating the process hadn't finished even though proc.join()
returned before the timeout duration.
If we do want a timeout (to raise an error if it hangs while trying to terminate) I'm not sure of a good way of implementing it. A couple of somewhat hacky ideas of waiting for the process to exit with a timeout:
is_alive()
or exitcode
on the process in a loopproc.join()
and working out if it timed out by checking how long it took using time.time()
Thank you for investigating this. We should have a timeout to prevent the test from running forever. What will happen if it times out and we do not raise an error (because we cannot determine whether it timed out)? Will it then fail with the original error described by this issue?
I've looked into it a bit more and tried a few things. For each test i just ran
while True:
with run_gunicorn():
pass
proc.join
proc.terminate()
s = time.time()
proc.join(40)
print(time.time() - s)
proc.join
mostly takes less than 0.5s, but occasionally takes just over 30s, sometimes along with a warning saying a gunicorn worker was killed. 30s is the default gunicorn worker timeout which suggests there's some sort of race condition with the gunicorn worker shutdown, although it could also be something else...If the proc.join
timeout is less than 30 seconds, when it times out the error does appear.
proc.join
time.sleep(0.2) # Added this
proc.terminate()
s = time.time()
proc.join(40)
print(time.time() - s)
proc.join
always seems to take less than 0.5 seconds. Seems to "fix" the race condition in the previous test.If I only include the time.sleep
but not the proc.join
the error starts appearing again.
proc.kill()
This is probably the simplest solution, although it's probably not ideal, not really sure.
I uploaded the outputs for test 1, test 1 but with a 20s timeout, and test 2, to https://gist.github.com/wookie184/6f92d13a77c24e94efe14218adf28924. The code in test 2 (but with a shorter timeout probably) seems like it could be an alright solution, although i'm not compeltely sure. Thoughts?
Integration test failed in the job that run when #173 was merged to main. We may not be cleaning up test resources properly.
From the logs: