`/health` endpoint improvements

jszwedko commented 3 years ago

Opening this as a tracking issue for improving the response of the, unreleased, healthcheck API.

As part of the on-going work to expose information from running vector processes via an exposed API, an initial /health endpoint was added. Right now this just returns a 200 OK.

This issue should be expanded to document exactly what the healthcheck API should return, but should serve as a placeholder until we focus on it.

[ ] #10556
[x] #9460
[x] #4249
[ ] #10555
[x] #9160
[x] #9469

afoninsky commented 3 years ago

Additional feedback from early adopters: it would be useful to specify a list of things to check, like "/health?inputs=kafka-test,xxx"

Use cases:

sometimes it's required to have a health check only for specific sources/sinks especially in case of complex configs -> service shouldn't fall if some specific downstream services is down
in k8s usually it's a good practice to have different checks for readiness and livenes probes

leebenson commented 3 years ago

For users seeing this for the first time, currently /health is enabled with the following in vector.toml

[api]
  enabled = true

By default, this runs on http://localhost:8686/health. You can override that by setting bind = <ip>:<port> in the [api] section.

The full response (including headers, showing a JSON content-type) is:

HTTP/1.1 200 OK content-type: application/json content-length: 11 date: Fri, 02 Oct 2020 10:09:20 GMT

{"ok":true}

There are also two other healthchecks available via the GraphQL endpoint (/graphql), which can be executed in the built-in GraphQL playground that runs on http://localhost:8686/playground:

{
  health
}

Returning:

{
  "data": {
    "health": true
  }
}

And:

subscription {
  heartbeat {
    utc
  }
}

Returning streaming heartbeats, by default every second:

{
  "data": {
    "heartbeat": {
      "utc": "2020-10-02T10:15:22.029304+00:00"
    }
  }
}

I think the intention is for these to ultimately evolve into a range of health checks that can be probed against individual sources and sinks. The GraphQL interface provides a type safe and broadly compatible HTTP interface, since it's just JSON requests and responses. There are also a many GraphQL client libs for a range of platforms that can assert that queries match the schema at compile or runtime, to provide additional safety to the caller.

If there's motivation for a simpler interface using HTTP GET URLs and simpler response codes, we could augment this with a similar interface for probing the same. I think the ground needs to settle in GraphQL first, though, since whatever interface we use there, will be the canonical interface for any additional entrypoints we bolt on later.

We could also potentially negotiate on the request content type and serialise to formats other than JSON, if there's a use-case for it. With serde, that should just be a case of choosing another encoder.

jszwedko commented 3 years ago

Popped up in discord again: https://discord.com/channels/742820443487993987/746076283074773150/822033677805420566

Oloremo commented 2 years ago

It would be great to have the health check endpoint exposed without enabling the API endpoint for security reasons.

spencergilbert commented 2 years ago

zamazan4ik commented 1 year ago

It would be great to have the health check endpoint exposed without enabling the API endpoint for security reasons.

@jszwedko What do you think about the idea? We have the same concerns regarding healthchecks - we want to be able to enable the healthcheck endpoint without the API endpoint for security reasons. I guess add an additional separate setting like:

[healthcheck]
enabled = true

would be a good starting point. Later it could be extended with another properties like liveness = true , readiness = true (see the information about it below).

Another way to improve: separate healthcheck into liveness check and readiness check (according to https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).

xenbyte commented 1 year ago

It would be nice if the healthcheck endpoint didn't return 200 if there was a misconfiguration on a reload, instead it could return something that informs us that Vector is actually up and running but there was a misconfig on the last reload.

This is nice in a situation where config files are generated and added to Vector in an automated fashion, and we don't want a misconfig on a reload.

itkovian commented 1 year ago

I would be very interested in a check where I can (at least) see if all the sinks are working properly.

mehta-ankit commented 7 months ago

@jszwedko any ETA on when we can have a /health endpoint on vector ?

bruceg commented 7 months ago

The /health endpoint actually already exists. It is referenced here though admittedly the documentation does not give specifics of the contents. See the issues referenced in the top post for more details.

vectordotdev / vector

`/health` endpoint improvements #4250