tricorder-observability / Starship

Starship: next-generation Observability platform built with eBPF+WASM
GNU Affero General Public License v3.0
163 stars 23 forks source link

Agent gracefully reconnecting API Server after streaming channel is broken #141

Open nascentcore-eng opened 1 year ago

nascentcore-eng commented 1 year ago

Is your feature request related to a problem? Please describe. Right now, when the gRPC streaming channel between API Server and agent is broken. Agent will crash. This actually is OK, as a broken gRPC streaming channel usually means API Server restarted. And agent crash and restarts, is sort of a poor-man's reconnecting logic.

git clone git@github.com:tricorder-observability/Starship.git
cd Starship
minikube start -p ${USER} --cpus=8 --memory=8196
skaffold run -f tools/skaffold/skaffold.yaml -n skaffold-tricorder
kubectl delete pod <api-server-pod>

We'll see the agent crashed and restarted.

Describe the solution you'd like Agent should have for loop outside of the StartModuleDeployLoop() and keeps invoking this API

Are you on Kubernetes Yes

Kernel version N/A

Describe alternatives you've considered N/A

Additional context N/A

feuyeux commented 1 year ago

Mark: Reconnect to the next endpoint from service list -- timeout and retry ...