pelias / pelias

Pelias is a modular open-source geocoder using Elasticsearch.
https://pelias.io
MIT License
3.2k stars 221 forks source link

pelias api docker image not able to recover from elasticsearch connection issues #925

Open InteNs opened 2 years ago

InteNs commented 2 years ago

Describe the bug Pelias api can't recover from a elasticsearch cluster reboot, keeps throwing bad requests on api searches. we have nightly server updates and 1 or more servers may reboot at 02:45 AM everytime this happens pelias api fails to recover. either because it loses connection to elastic or because it is restarted and initialises before elastic was up.

because the process doesn't exit, docker swarm can't restart it and thus the api is unusable until manual restart.

Steps to Reproduce Steps to reproduce the behavior:

  1. run the pelias api docker image and elasticsearch
  2. restart elasticsearch
  3. See error

Expected behavior the container fails and restarts -> keeps retrying the elasticsearch endpoint

Environment (please complete the following information):

Pastebin/Screenshots image

Additional context snippet from pelias config:

{
  "logger": {
    "level": "debug",
    "timestamp": false
  },
  "esclient": {
    "apiVersion": "7.5",
    "hosts": [
      { "host": "***redacted***" }
    ]
  },
  "elasticsearch": {
    "settings": {
      "index": {
        "refresh_interval": "10s",
        "number_of_replicas": "0",
        "number_of_shards": "1"
      }
    }
  }

References

https://github.com/pelias/api/issues/1419 https://github.com/pelias/api/issues/1591 https://github.com/pelias/docker/issues/49

orangejulius commented 2 years ago

Hi @InteNs,

Interesting, thanks for the report.

Are you able to reproduce this on our standard docker project? I would love to see the error messages in that case.

Here's what I tried, and everything went as I expected:

# start all the docker containers safely
$ pelias elastic start
$ pelias elastic wait
$ pelias compose up

# query the API, no errors
$ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
null

# shut down elasticsearch
pelias compose kill elasticsearch

# query errors as expected
$ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
[
  "No Living connections"
]

# start elasticsearch again and query. no errors
 $ pelias elastic start
Starting pelias_elasticsearch ... done
$ pelias elastic wait
waiting for elasticsearch service to come up
.......Elasticsearch up!
 $ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
null

Is it possible the only error your seeing is the one from https://github.com/pelias/api/issues/1591 (the type mapping discovery)? Or maybe there is something different about Docker Swarm (I've never used it, but I would guess it's very similar to using Docker locally)?

The API failing to correctly handle Elasticsearch queries after a momentary connection disruption would indeed be a big problem, so we definitely want to help you figure out what's going on if the issue is as you describe.

InteNs commented 2 years ago

Hmm it might be related to type mapping indeed, multiple servers may reboot simultaneously when certain updates are made.

In that case https://github.com/pelias/api/issues/1591 is the actual issue. was there a resolution for that?

The difference with this setup in docker swarm is that it's a bit more complicated to setup the "wait for elastic" part of the fix described in 1591

That would involve a custom entrypoint with a shell script. This would require us making our own image extending the official pelias api image.

I'd much rather keep using the official image :)

orangejulius commented 2 years ago

There's no resolution for https://github.com/pelias/api/issues/1591 yet, we've been working around it where we see it with pelias elastic wait or similar logic.

While the issue suggests adding retry logic, I was actually thinking another (possibly better?) solution would be to throw a fatal error and shut down the API if the type mapping discovery call fails.

Most systems people use to run the API these days (Kubernetes, Docker, etc) are happy to restart an API process that has shut down completely. And because the type mapping discovery check is only made once at API startup, there's no risk of something like an intermittent connection issue causing a cascading failure later on.

If you are not using any custom data, then you can probably ignore this error, but if you do need that discovery call to succeed, then maybe try adding some error handling here that shuts down the API (process.exit(1) should be fine).

I think we'd be happy to accept a PR for that if it works well for you.