tidwall / tile38

Real-time Geospatial and Geofencing
https://tile38.com
MIT License
9.15k stars 570 forks source link

Follower Readiness Probe #608

Open iwpnd opened 3 years ago

iwpnd commented 3 years ago

Is your feature request related to a problem? Please describe. We use Tile38 in kubernetes and autoscale the followers as need be. This works like a charm, until it doesn’t.

My problem currently is the readinessProbe for the followers. If a follower receives traffic, before it is caught up, it errors out. It hasn’t been an issue, but the more data we process the longer the new followers take to catch up to the leader, obviously.

A solution is something that is described here. But under no circumstance do I want to maintain a custom image alongside your image and execute python scripts in the readinessProbe.

Describe the solution you'd like If I wanted to stick to the image you provide, I would have to do something like:

readinessProbe:
  exec:
    command: ["sh", "-c", "wget http://follower-url/server -O - | grep -e '\"caught_up\":true' > /dev/null && exit 0 || ext 1"]
  initialDelaySeconds: 60

This will work as long as I do not screw up the url for wget.

Kubernetes also provides a readinessProbe with httpGet that would make this a little safer. If followers were to return 200 on an endpoint only if "caught_up": true,then Kubernetes could poll that endpoint for a given period. If the error code is <400 Kubernetes would start to route request to it.

readinessProbe:
  httpGet:
    scheme: HTTP
    path: /healthz
    port: 9851
  initialDelaySeconds: 60

Something similar can be done in docker-compose also with the use of dokku/wait.

Do you think it would make sense to include this in Tile38 and if so, can I give it a try? :)

Edit: /healthz would be better, agreed @stevelacy

stephenlacy commented 3 years ago

(note: I'm the author of that post) In my case I was wanting to check a specific metric num_objects, you would be just needing the state of the follower to be "caught up". I would suggest a standard /healthz path, with 200 for online/ready and 500 when not.

tidwall commented 3 years ago

Makes sense to me.

tidwall commented 3 years ago

See PR #609.

tidwall commented 3 years ago

I just pushed a new release that adds the new HEALTHZ command.

iwpnd commented 3 years ago

Thanks again for pushing this so fast @tidwall ! So today I got along to test this out, but failed miserably. I should've tested it locally first I guess. :)

Now I got along to test it locally with this docker-compose.yml

version: "3"

services:
  tile38-leader:
    image: tile38/tile38:1.24.2
    container_name: tile38-leader
    command: tile38-server -vv -p 9851
    ports:
      - 9851:9851

  tile38-follower:
    image: tile38/tile38:1.24.2
    container_name: tile38-follower
    command: >
      /bin/sh -c 'mkdir -p tmp/data && \
                  echo "{\"follow_host\": \"tile38-leader\",\"follow_port\":9851}" > tmp/data/config
                  tile38-server -d tmp/data -vv -p 9852'
    ports:
      - 9852:9852

The logs tell me:

tile38-follower    | 2021/06/09 14:11:48 [INFO] caught up

But HEALTHZ returns

// follower
curl http://localhost:9852/HEALTHZ
>> {"ok":false,"err":"not caught up","elapsed":"170.2µs"}
// leader
curl http://localhost:9851/HEALTHZ
>> {"ok":true,"elapsed":"170.2µs"}

Am I missing something here?

tidwall commented 3 years ago

I just fixed the issue and pushed a new build.

iwpnd commented 3 years ago

tested, worked. thanks alot! 👍 🚀

tidwall commented 3 years ago

You're welcome :)