openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.94k stars 3.08k forks source link

document /health endpoints, useful for liveProve, readyProbe in k8s deployments. #3688

Open Mistobaan opened 6 years ago

Mistobaan commented 6 years ago

I am trying to deploy the docker image gcr.io/stackdriver-trace-docker/zipkin-collector inside kubernetes cluster. I was wondering what would be the best endpoint to have the liveProve / readyProbe functionality for a kubernetes pod.

Mistobaan commented 6 years ago

I just noticed in the startup logs that /health is enabled

bogdandrutu commented 6 years ago

What can I do to help?

negz commented 6 years ago

Just chiming in - I run the zipkin collector in a Kubernetes cluster and I'm using the /health endpoint for liveness and readiness probes without issue. It would be nice if the endpoint were documented.

Mistobaan commented 6 years ago

exactly, the endpoint is there but is not documented. I changed the title of the issue.

codefromthecrypt commented 8 months ago

so the health endpoint is mentioned here, but yeah not well documented. We also use this for the HEALTHCHECK directive in docker. I'll move this issue to the main repo, noting that there is an emerging https://github.com/openzipkin/zipkin-helm which should master the info on k8s stuff

https://github.com/openzipkin/zipkin/tree/master/zipkin-server#endpoints

codefromthecrypt commented 8 months ago

So, in summary, probably we should coalesce on a practice before documenting one, but gut feel is adapting from our Dockerfile one is not a bad start. We probably need some advice to clarify lack of startup and liveness probes in the ecosystem, e.g. if that's a feature or a bug. cc @mfordjody @optional303


To begin this, /health is a composite status of the heath of zipkin's dependencies. For example, if zipkin is configured for stackdriver or kafka and either connection don't work, /health will return non 200 code.

here is the text on out HEALTHCHECK in docker, which ack is disabled for k8s, but is the same basic info

# We use start period of 30s to avoid marking the container unhealthy on slow or contended CI hosts.
#
# If in production, you have a 30s startup, please report to https://gitter.im/openzipkin/zipkin
# including the values of the /health and /info endpoints as this would be unexpected.
HEALTHCHECK --interval=5s --start-period=30s --timeout=5s CMD ["docker-healthcheck"]

https://github.com/openzipkin/zipkin/blob/master/docker/Dockerfile#L66-L70

From https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold.

The incubating zipkin-helm chart in this org seems to have defined readiness, but not liveness, from docker HEALTHCHECK, notably missing the start period

          readinessProbe:
            httpGet:
              path: /health
              port: 9411
            initialDelaySeconds: 5
            periodSeconds: 5

https://github.com/openzipkin/zipkin-helm/blob/475ebd84f98992d423ef5d417756c388c9cdfb68/charts/zipkin/templates/deployment.yaml#L71-L76

The setup above is consistent with a few other helm charts including https://github.com/radius-project/radius/blob/21b25ddf265f0464e4641b8c79cff61a4f9badd0/deploy/monitoring/zipkin-mem.yaml#L19 and https://github.com/apache/dubbo-kubernetes/blob/c4f2898e4eacd978c780bec79989a465f3e5a9dd/deploy/kubernetes/zipkin.yaml#L91-L96

That said, Financial Times is using a socket for one and /health for the other, here, defaulting the startup delay to 200

          livenessProbe:
            initialDelaySeconds: {{ .Values.ui.probeStartupDelay }}
            tcpSocket:
              port: {{ .Values.ui.queryPort }}
          readinessProbe:
            initialDelaySeconds: {{ .Values.ui.probeStartupDelay }}
            httpGet:
              path: /health
              port: {{ .Values.ui.queryPort }}

https://github.com/Financial-Times/zipkin-helm/blob/0ae00a2e2be986b58de30475601ea5dc686ea0fd/templates/zipkin-ui.yaml#L29-L37

codefromthecrypt commented 7 months ago

While spring-boot usage is an internal detail (so we wouldn't use their mappings or rely on how they do things like via an event bus which is TMI), that boot explicitly uses different HTTP paths for liveness and readiness is interesting and useful research https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot