Integration test failing

MarkKoz commented 1 year ago

Integration test failed in the job that run when #173 was merged to main. We may not be cleaning up test resources properly.

From the logs:

2023-05-09T21:12:56.6642706Z ....................2023-05-09 21:12:56,663 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:56.6656639Z 2023-05-09 21:12:56,665 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:57.6723942Z /root/.local/lib/python3.11/site-packages/gunicorn/sock.py:-1: ResourceWarning: unclosed <socket.socket fd=12, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
2023-05-09T21:12:57.6724595Z ResourceWarning: Enable tracemalloc to get the object allocation traceback
2023-05-09T21:12:57.6728675Z 2023-05-09 21:12:57,672 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:57.6734120Z 2023-05-09 21:12:57,672 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:58.6740294Z 2023-05-09 21:12:58,673 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:58.6740788Z 2023-05-09 21:12:58,673 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:59.6747465Z 2023-05-09 21:12:59,674 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:59.6748104Z 2023-05-09 21:12:59,674 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:13:00.6754516Z 2023-05-09 21:13:00,674 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:13:00.6755618Z 2023-05-09 21:13:00,675 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:13:01.6759062Z 2023-05-09 21:13:01,675 |    33 |                 gunicorn.error |    ERROR | Can't connect to ('0.0.0.0', 8060)
2023-05-09T21:13:25.6810318Z E..............................................
2023-05-09T21:13:25.6810671Z ======================================================================
2023-05-09T21:13:25.6811289Z ERROR: test_memory_limit_separate_per_process (tests.test_integration.IntegrationTests.test_memory_limit_separate_per_process)
2023-05-09T21:13:25.6855942Z Each NsJail process should have its own memory limit.
2023-05-09T21:13:25.6856539Z ----------------------------------------------------------------------
2023-05-09T21:13:25.6856830Z Traceback (most recent call last):
2023-05-09T21:13:25.6857166Z   File "/snekbox/tests/test_integration.py", line 42, in test_memory_limit_separate_per_process
2023-05-09T21:13:25.6857468Z     with run_gunicorn():
2023-05-09T21:13:25.6857739Z   File "/usr/local/lib/python3.11/contextlib.py", line 137, in __enter__
2023-05-09T21:13:25.6858004Z     return next(self.gen)
2023-05-09T21:13:25.6858202Z            ^^^^^^^^^^^^^^
2023-05-09T21:13:25.6858709Z   File "/snekbox/tests/gunicorn_utils.py", line 76, in run_gunicorn
2023-05-09T21:13:25.6859071Z     raise RuntimeError(f"Gunicorn terminated unexpectedly with code {proc.exitcode}.")
2023-05-09T21:13:25.6859436Z RuntimeError: Gunicorn terminated unexpectedly with code 1.
2023-05-09T21:13:25.6859625Z 
2023-05-09T21:13:25.6859862Z ----------------------------------------------------------------------
2023-05-09T21:13:25.6860319Z Ran 67 tests in 29.605s
2023-05-09T21:13:25.6860451Z 
2023-05-09T21:13:25.6860532Z FAILED (errors=1)
2023-05-09T21:13:25.6868264Z 
2023-05-10T03:12:51.7214357Z ##[error]The operation was canceled.

wookie184 commented 1 year ago

I can reproduce this consistently by running

while True:
    with run_gunicorn():
        ...

and waiting a bit. It seems like with run_gunicorn() can finish before the actual process is dead, allowing a new process to be created which causes an error as the port is still in use by the last process.

run_gunicorn() terminates the process here: https://github.com/python-discord/snekbox/blob/9804a10a598b678225d299178113210f74a25392/tests/gunicorn_utils.py#L80

Adding a proc.join() after this seems to fix the issue.

I also tried proc.join with a timeout, but couldn't reliably check if it timed out or actually finished terminating afterwards. The docs suggest that you can check exitcode, but in some cases this was indicating the process hadn't finished even though proc.join() returned before the timeout duration.

If we do want a timeout (to raise an error if it hangs while trying to terminate) I'm not sure of a good way of implementing it. A couple of somewhat hacky ideas of waiting for the process to exit with a timeout:

Checking is_alive() or exitcode on the process in a loop
Using proc.join() and working out if it timed out by checking how long it took using time.time()

MarkKoz commented 1 year ago

Thank you for investigating this. We should have a timeout to prevent the test from running forever. What will happen if it times out and we do not raise an error (because we cannot determine whether it timed out)? Will it then fail with the original error described by this issue?

wookie184 commented 1 year ago

I've looked into it a bit more and tried a few things. For each test i just ran

while True:
    with run_gunicorn():
        pass

Test 1 - just `proc.join`

        proc.terminate()
        s = time.time()
        proc.join(40)
        print(time.time() - s)

The error never seems to appear.
proc.join mostly takes less than 0.5s, but occasionally takes just over 30s, sometimes along with a warning saying a gunicorn worker was killed. 30s is the default gunicorn worker timeout which suggests there's some sort of race condition with the gunicorn worker shutdown, although it could also be something else...

If the proc.join timeout is less than 30 seconds, when it times out the error does appear.

Test 2 - sleep before terminating and `proc.join`

        time.sleep(0.2)  # Added this
        proc.terminate()
        s = time.time()
        proc.join(40)
        print(time.time() - s)

The error never seems to appear.
proc.join always seems to take less than 0.5 seconds. Seems to "fix" the race condition in the previous test.

If I only include the time.sleep but not the proc.join the error starts appearing again.

Test 3 - kill instead of terminate

proc.kill()

The error never seems to appear.

This is probably the simplest solution, although it's probably not ideal, not really sure.

I uploaded the outputs for test 1, test 1 but with a 20s timeout, and test 2, to https://gist.github.com/wookie184/6f92d13a77c24e94efe14218adf28924. The code in test 2 (but with a shorter timeout probably) seems like it could be an alright solution, although i'm not compeltely sure. Thoughts?

python-discord / snekbox