openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Fix idle runner being memory leaked #601

Closed mpass99 closed 4 months ago

mpass99 commented 4 months ago

when its allocation is restarted by Nomad.

Related to #591. Fix logic created in 354c16cc.

We have noticed that some runners continue sending monitoring data even if their inactivity timer must have been passed. This results in a memory leak.
The logs of the affected runners show the pattern that the runners are restarted while still being in the idleRunner state.

In the code, we can verify that new Runners are created when Nomad events restart idle runners, but the old ones are just unlinked but not destroyed.

Logs

``` 10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-04-30T22:22:12.778722Z" level=debug msg="Runner started" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 startupDuration=992.568518ms time="2024-05-01T01:36:30.970661Z" level=debug msg="Runner stopped" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T01:36:31.194869Z" level=debug msg="Runner started" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 startupDuration=224.194539ms time="2024-05-01T01:36:37.639731Z" level=debug msg="Runner stopped" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T01:36:39.293207Z" level=debug msg="Runner started" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 startupDuration=1.653457979s time="2024-05-01T02:29:36.008450Z" level=debug msg="Runner stopped" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T02:29:36.271641Z" level=debug msg="Runner started" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 startupDuration=263.170034ms time="2024-05-01T02:29:42.646686Z" level=debug msg="Runner stopped" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T02:29:44.289026Z" level=debug msg="Runner started" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 startupDuration=1.642315797s time="2024-05-01T05:46:47.704503Z" level=debug code=204 duration=109.500811ms method=PATCH path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files time="2024-05-01T05:46:47.812686Z" level=debug code=200 duration="66.446µs" method=POST path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/execute time="2024-05-01T05:46:47.833931Z" level=info msg="Running execution" environment_id=10 executionID=9eb1f845-9351-4d3b-b453-a76bc41a3747 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:46:48.308444Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:46:48.308503Z" level=debug code=200 duration=474.645284ms method=GET path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T05:46:48.432376Z" level=debug code=200 duration=114.368078ms method=GET path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files time="2024-05-01T05:47:48.230770Z" level=debug code=204 duration=114.325427ms method=PATCH path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files time="2024-05-01T05:47:48.333332Z" level=debug code=200 duration="105.811µs" method=POST path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/execute time="2024-05-01T05:47:48.355094Z" level=info msg="Running execution" environment_id=10 executionID=185e6e1e-81ab-4d5e-9381-108e698ac5e3 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:47:48.819423Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:47:48.819475Z" level=debug code=200 duration=464.476081ms method=GET path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T05:47:48.943934Z" level=debug code=200 duration=119.057936ms method=GET path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files time="2024-05-01T05:48:07.120758Z" level=debug code=204 duration=105.314738ms method=PATCH path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files time="2024-05-01T05:48:07.230021Z" level=debug code=200 duration="65.564µs" method=POST path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/execute time="2024-05-01T05:48:07.255425Z" level=info msg="Running execution" environment_id=10 executionID=95f078fa-de72-407a-97cb-abd06b2e37e5 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:22.257766Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:25.257865Z" level=info msg="the execution did not stop after SIGQUIT" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:25.258122Z" level=debug msg="Destroying Runner" destroy_reason="the execution did not stop after SIGQUIT" package=runner runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:25.263306Z" level=debug msg="Execution canceled by context" environment_id=10 package=nomad runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:25.263334Z" level=debug msg="Execution canceled by context" environment_id=10 package=nomad runner_id=10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:48:27.500283Z" level=debug code=200 duration=20.244971308s method=GET path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T05:48:35.684094Z" level=debug code=410 duration="33.474µs" method=DELETE path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8 time="2024-05-01T05:49:32.284637Z" level=debug code=410 duration="36.641µs" method=PATCH path=/api/v1/runners/10-1707ab95-0740-11ef-b832-fa163e7afdf8/files 10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-04-30T21:47:38.133840Z" level=debug msg="Runner started" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 startupDuration=1.090064648s time="2024-05-01T02:29:35.622070Z" level=debug msg="Runner stopped" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T02:29:36.004532Z" level=debug msg="Runner started" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 startupDuration=382.443508ms time="2024-05-01T02:29:42.720641Z" level=debug msg="Runner stopped" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T02:29:44.291501Z" level=debug msg="Runner started" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 startupDuration=1.57083017s time="2024-05-01T03:14:06.486337Z" level=debug code=204 duration=115.412113ms method=PATCH path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:14:06.611415Z" level=debug code=200 duration="100.001µs" method=POST path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/execute time="2024-05-01T03:14:06.636978Z" level=info msg="Running execution" environment_id=10 executionID=caee40de-54ce-404a-b556-71ebd1adfaf7 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:14:07.129213Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:14:07.129493Z" level=debug code=200 duration=492.582544ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T03:14:07.258599Z" level=debug code=200 duration=121.62938ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:16:27.873300Z" level=debug code=204 duration=114.17337ms method=PATCH path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:16:27.973027Z" level=debug code=200 duration="132.132µs" method=POST path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/execute time="2024-05-01T03:16:28.000494Z" level=info msg="Running execution" environment_id=10 executionID=e0ca0d60-3bed-4e23-80eb-a639c4c07748 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:16:28.468902Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:16:28.468959Z" level=debug code=200 duration=468.524963ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T03:16:28.590227Z" level=debug code=200 duration=113.610798ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:17:03.295903Z" level=debug code=204 duration=107.03668ms method=PATCH path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:17:03.406403Z" level=debug code=200 duration="64.443µs" method=POST path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/execute time="2024-05-01T03:17:03.426941Z" level=info msg="Running execution" environment_id=10 executionID=afebd292-ee87-4778-aa2f-503f6ac1d454 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:17:03.942834Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:17:03.942885Z" level=debug code=200 duration=516.001604ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-05-01T03:17:04.062606Z" level=debug code=200 duration=113.708876ms method=GET path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files time="2024-05-01T03:20:03.949353Z" level=debug msg="Destroying Runner" destroy_reason="runner inactivity timeout exceeded" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:20:03.961675Z" level=info msg="Returning runner due to inactivity timeout" package=runner runner_id=10-42634265-073b-11ef-b832-fa163e7afdf8 time="2024-05-01T03:21:56.732775Z" level=debug code=410 duration=1.104425ms method=PATCH path=/api/v1/runners/10-42634265-073b-11ef-b832-fa163e7afdf8/files 10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:30:35.367481Z" level=debug msg="Runner started" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 startupDuration=470.69834ms time="2024-04-26T11:34:18.670189Z" level=warning msg="Loading and synchronizing Runners failed! Retrying..." error="synchronize runners failed: nomad Event Stream failed!: error receiving events: unexpected EOF" package=environment time="2024-04-26T11:34:19.672280Z" level=debug msg="Destroying Runner" destroy_reason="the destruction should not cause external changes" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:34:19.672288Z" level=debug msg="Runner destroyed locally" destroy_reason="the destruction should not cause external changes" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:34:21.730977Z" level=debug msg="Recovered Allocation" jobID=10-65c81476-03c0-11ef-b832-fa163e7afdf8 package=nomad status=running time="2024-04-26T11:34:21.757387Z" level=debug msg="Recovered Runner" isUsed=false package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:34:56.840117Z" level=debug msg="Runner stopped" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:34:56.992281Z" level=debug msg="Runner started" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 startupDuration=152.138704ms time="2024-04-26T11:35:04.226674Z" level=debug code=204 duration=119.948017ms method=PATCH path=/api/v1/runners/10-65c81476-03c0-11ef-b832-fa163e7afdf8/files time="2024-04-26T11:35:04.319473Z" level=debug code=200 duration="121.733µs" method=POST path=/api/v1/runners/10-65c81476-03c0-11ef-b832-fa163e7afdf8/execute time="2024-04-26T11:35:04.343181Z" level=info msg="Running execution" environment_id=10 executionID=1bc9e9fd-34e6-4790-beae-4533e2f8546b package=api runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:35:04.912407Z" level=info msg="Execution returned" environment_id=10 package=api runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:35:04.912469Z" level=debug code=200 duration=569.389067ms method=GET path=/api/v1/runners/10-65c81476-03c0-11ef-b832-fa163e7afdf8/websocket user_agent= time="2024-04-26T11:35:05.039795Z" level=debug code=200 duration=121.01919ms method=GET path=/api/v1/runners/10-65c81476-03c0-11ef-b832-fa163e7afdf8/files time="2024-04-26T11:35:26.208897Z" level=debug msg="Runner started" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 startupDuration=763.967014ms time="2024-04-26T11:35:26.209023Z" level=error msg="Started Runner is already in use" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 states="map[default-task:0xc003da0990]" time="2024-04-26T11:35:26.254551Z" level=debug msg="Runner stopped" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:35:26.254676Z" level=debug msg="Destroying Runner" destroy_reason="the allocation completed" package=runner runner_id=10-65c81476-03c0-11ef-b832-fa163e7afdf8 time="2024-04-26T11:35:39.721194Z" level=debug code=410 duration="45.778µs" method=PATCH path=/api/v1/runners/10-65c81476-03c0-11ef-b832-fa163e7afdf8/files time="2024-04-26T11:49:44.610704Z" level=warning msg="Prewarming Pool Alert. Reloading environment" environment_id=10 package=runner ```

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 77.12%. Comparing base (b9e978c) to head (253ce97). Report is 2 commits behind head on main.

Files Patch % Lines
internal/runner/nomad_manager.go 80.00% 1 Missing and 1 partial :warning:
internal/environment/aws_environment.go 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #601 +/- ## ========================================== - Coverage 77.15% 77.12% -0.04% ========================================== Files 41 41 Lines 3493 3497 +4 ========================================== + Hits 2695 2697 +2 - Misses 586 587 +1 - Partials 212 213 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.