python-discord / snekbox

Easy, safe evaluation of arbitrary Python code
https://pythondiscord.com
MIT License
214 stars 39 forks source link

Integration test failing #177

Closed MarkKoz closed 1 year ago

MarkKoz commented 1 year ago

Integration test failed in the job that run when #173 was merged to main. We may not be cleaning up test resources properly.

From the logs:

2023-05-09T21:12:56.6642706Z ....................2023-05-09 21:12:56,663 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:56.6656639Z 2023-05-09 21:12:56,665 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:57.6723942Z /root/.local/lib/python3.11/site-packages/gunicorn/sock.py:-1: ResourceWarning: unclosed <socket.socket fd=12, family=2, type=1, proto=0, laddr=('0.0.0.0', 0)>
2023-05-09T21:12:57.6724595Z ResourceWarning: Enable tracemalloc to get the object allocation traceback
2023-05-09T21:12:57.6728675Z 2023-05-09 21:12:57,672 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:57.6734120Z 2023-05-09 21:12:57,672 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:58.6740294Z 2023-05-09 21:12:58,673 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:58.6740788Z 2023-05-09 21:12:58,673 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:12:59.6747465Z 2023-05-09 21:12:59,674 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:12:59.6748104Z 2023-05-09 21:12:59,674 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:13:00.6754516Z 2023-05-09 21:13:00,674 |    33 |                 gunicorn.error |    ERROR | Connection in use: ('0.0.0.0', 8060)
2023-05-09T21:13:00.6755618Z 2023-05-09 21:13:00,675 |    33 |                 gunicorn.error |    ERROR | Retrying in 1 second.
2023-05-09T21:13:01.6759062Z 2023-05-09 21:13:01,675 |    33 |                 gunicorn.error |    ERROR | Can't connect to ('0.0.0.0', 8060)
2023-05-09T21:13:25.6810318Z E..............................................
2023-05-09T21:13:25.6810671Z ======================================================================
2023-05-09T21:13:25.6811289Z ERROR: test_memory_limit_separate_per_process (tests.test_integration.IntegrationTests.test_memory_limit_separate_per_process)
2023-05-09T21:13:25.6855942Z Each NsJail process should have its own memory limit.
2023-05-09T21:13:25.6856539Z ----------------------------------------------------------------------
2023-05-09T21:13:25.6856830Z Traceback (most recent call last):
2023-05-09T21:13:25.6857166Z   File "/snekbox/tests/test_integration.py", line 42, in test_memory_limit_separate_per_process
2023-05-09T21:13:25.6857468Z     with run_gunicorn():
2023-05-09T21:13:25.6857739Z   File "/usr/local/lib/python3.11/contextlib.py", line 137, in __enter__
2023-05-09T21:13:25.6858004Z     return next(self.gen)
2023-05-09T21:13:25.6858202Z            ^^^^^^^^^^^^^^
2023-05-09T21:13:25.6858709Z   File "/snekbox/tests/gunicorn_utils.py", line 76, in run_gunicorn
2023-05-09T21:13:25.6859071Z     raise RuntimeError(f"Gunicorn terminated unexpectedly with code {proc.exitcode}.")
2023-05-09T21:13:25.6859436Z RuntimeError: Gunicorn terminated unexpectedly with code 1.
2023-05-09T21:13:25.6859625Z 
2023-05-09T21:13:25.6859862Z ----------------------------------------------------------------------
2023-05-09T21:13:25.6860319Z Ran 67 tests in 29.605s
2023-05-09T21:13:25.6860451Z 
2023-05-09T21:13:25.6860532Z FAILED (errors=1)
2023-05-09T21:13:25.6868264Z 
2023-05-10T03:12:51.7214357Z ##[error]The operation was canceled.
wookie184 commented 1 year ago

I can reproduce this consistently by running

while True:
    with run_gunicorn():
        ...

and waiting a bit. It seems like with run_gunicorn() can finish before the actual process is dead, allowing a new process to be created which causes an error as the port is still in use by the last process.

run_gunicorn() terminates the process here: https://github.com/python-discord/snekbox/blob/9804a10a598b678225d299178113210f74a25392/tests/gunicorn_utils.py#L80

Adding a proc.join() after this seems to fix the issue.

I also tried proc.join with a timeout, but couldn't reliably check if it timed out or actually finished terminating afterwards. The docs suggest that you can check exitcode, but in some cases this was indicating the process hadn't finished even though proc.join() returned before the timeout duration.

If we do want a timeout (to raise an error if it hangs while trying to terminate) I'm not sure of a good way of implementing it. A couple of somewhat hacky ideas of waiting for the process to exit with a timeout:

MarkKoz commented 1 year ago

Thank you for investigating this. We should have a timeout to prevent the test from running forever. What will happen if it times out and we do not raise an error (because we cannot determine whether it timed out)? Will it then fail with the original error described by this issue?

wookie184 commented 1 year ago

I've looked into it a bit more and tried a few things. For each test i just ran

while True:
    with run_gunicorn():
        pass

Test 1 - just proc.join

        proc.terminate()
        s = time.time()
        proc.join(40)
        print(time.time() - s)

If the proc.join timeout is less than 30 seconds, when it times out the error does appear.

Test 2 - sleep before terminating and proc.join

        time.sleep(0.2)  # Added this
        proc.terminate()
        s = time.time()
        proc.join(40)
        print(time.time() - s)

If I only include the time.sleep but not the proc.join the error starts appearing again.

Test 3 - kill instead of terminate

proc.kill()

This is probably the simplest solution, although it's probably not ideal, not really sure.


I uploaded the outputs for test 1, test 1 but with a 20s timeout, and test 2, to https://gist.github.com/wookie184/6f92d13a77c24e94efe14218adf28924. The code in test 2 (but with a shorter timeout probably) seems like it could be an alright solution, although i'm not compeltely sure. Thoughts?