API Server and Agents need to handle agents restart gracefully

tricorder-observability / Starship

Starship: next-generation Observability platform built with eBPF+WASM

GNU Affero General Public License v3.0

164 stars 24 forks source link

Right now, when agents restart, they lost all of their deployed modules (BCC eBPF programs are unloaded automatically by the kernel, and the WASM module was destroyed alone with the process), and all those deployed modules wont be redeployed by API server.

The goal of this issue is to implement the logic to make it such that:

When agents restarted, the deployed modules should be redeployed.

We thought about adding sqlite to agent, but realized that it's not necessary, as API server should already have the desired state, so it can avoid adding extra complexity.

One approach could be for agents to report its state upone establishing connections with API server. And API server, based on its stored desired state of the modules, should inform agents (i.e., let agents know the desired sates of the modules), and agents should react accordingly.

one way I can think of is (not really familiar with eBPF/BCC/WASM) :

separate the agent and module in process level

agent launch a new process when deploy module agent communicate with module by rpc (or other method)

agent check module health by heartbeat (sending from module to agent)
agent collect module data by rpc (or share memory)

if agent restart, it won't kill the module, so that module won't be destoryed after agent restart successfully, module will try to reconnect to agent (by retry automatically), so everything go normal

Problems

what if agent never restart, how to deal with module (exit after retry is exhausted?)
how long should agent wait for module reconnect (agent need to know if it needs to launch new module base on themodule instance)
...

tricorder-observability / Starship

API Server and Agents need to handle agents restart gracefully #7

Problems