Instant Nomad Allocation Restart leads to Runner Memory Leak

mpass99 commented 4 months ago

Related to #591

In the case of runner 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 we see that the runner is started twice and deleted directly after the second creation.

time="2024-04-26T11:34:59.570546Z" level=debug msg="Runner started" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 startupDuration=570.731307ms
time="2024-04-26T11:35:26.198631Z" level=debug msg="Runner started" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 startupDuration=705.586839ms
time="2024-04-26T11:35:26.198780Z" level=debug msg="Destroying Runner" destroy_reason="the runner will be destroyed and replaced: the destruction should not cause external changes" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8
time="2024-04-26T11:35:26.198923Z" level=debug msg="Runner destroyed locally" destroy_reason="the runner will be destroyed and replaced: the destruction should not cause external changes" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8
time="2024-04-26T11:35:26.239789Z" level=debug msg="Runner stopped" package=runner runner_id=10-0331c7d8-03c1-11ef-b832-fa163e7afdf8

When inspecting the InfluxDB Nomad Allocation Events we see that the Nomad Allocation restart at 11:35:26.285 happens instantly.

_measurement | _time | client_status | event_type | id | job_id | nomad_agent
poseidon_nomad_allocations | 2024-04-26T11:34:59.274Z | pending | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:00.275Z | running | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:25.279Z | running | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | running | creation | 28e08715-38a9-42b3-8f77-0a14ee68b482 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-1
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | complete | deletion | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | pending | creation | 28e08715-38a9-42b3-8f77-0a14ee68b482 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-1
poseidon_nomad_allocations | 2024-04-26T11:35:26.285Z | complete | creation | 84a734a1-5573-6116-5678-86060ce4c479 | 10-0331c7d8-03c1-11ef-b832-fa163e7afdf8 | nomad-agent-terraform-4

The Poseidon logs show that first the second creation event was handled and only after that the stop of the previous allocation was handled.

This leads (a) to the first started runner never being destroyed and causing a memory leak and (b) one idle runner too few.

In order to fix this, we might relate the Runner closer to the Allocation ID than the Runner ID. This can be done either by (1) adding the allocation ID to the Nomad Runner Object or (2) enlarging the allocation ID handling in the Event Stream Handling.
Both have their up and downsides. (1) might create the impression that multiple runners could have the same runner id which should not happen. (2) might increase the complexity of the already complex Event Stream Handling.

MrSerth commented 4 months ago

Thanks for identifying this issue and providing two suggestions on how to resolve it. What would be your recommendation? I see that both approaches have disadvantages, and find it difficult to decide for one. My main concern with the first one would be that this could lead to data inconsistencies (with the runner management), but I cannot say how difficult the Event Stream Handling will be.

mpass99 commented 1 month ago

A solution via the Event Stream Handling would be to track the node. Complete Events would be only handled if they match the node that was used directly before.

However, another solution (at least for this specific situation) is to handle the runner destroying already when an allocation desires to stop. Before, we removed the runner only when the allocation stopped completely. When checking the event history of this specific case, there was enough time between the first allocation announcing to stop and the second allocation fully running. Therefore, the changes fix the failure in this case.

I decided to go with the second solution, to introduce no further complexity and as few code changes as possible. @MrSerth Do you agree with this solution?

Either way, we should observe the Nomad behavior closely in the near future.

MrSerth commented 1 month ago

Yes, I am fine with the solution you proposed, this makes sense -- thank you! Let's monitor Nomad, as you suggested (we need to do so anyway, I think, due to the vast number of changes).

MrSerth commented 1 month ago

We didn't notice any other occurrence. The memory footprint has improved, so that we are closing this issue as completed.

openHPI / poseidon

Instant Nomad Allocation Restart leads to Runner Memory Leak #602