Open InteNs opened 2 years ago
Hi @InteNs,
Interesting, thanks for the report.
Are you able to reproduce this on our standard docker project? I would love to see the error messages in that case.
Here's what I tried, and everything went as I expected:
# start all the docker containers safely
$ pelias elastic start
$ pelias elastic wait
$ pelias compose up
# query the API, no errors
$ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
null
# shut down elasticsearch
pelias compose kill elasticsearch
# query errors as expected
$ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
[
"No Living connections"
]
# start elasticsearch again and query. no errors
$ pelias elastic start
Starting pelias_elasticsearch ... done
$ pelias elastic wait
waiting for elasticsearch service to come up
.......Elasticsearch up!
$ curl -s "localhost:4000/v1/autocomplete?text=portland" | jq .geocoding.errors
null
Is it possible the only error your seeing is the one from https://github.com/pelias/api/issues/1591 (the type mapping discovery)? Or maybe there is something different about Docker Swarm (I've never used it, but I would guess it's very similar to using Docker locally)?
The API failing to correctly handle Elasticsearch queries after a momentary connection disruption would indeed be a big problem, so we definitely want to help you figure out what's going on if the issue is as you describe.
Hmm it might be related to type mapping indeed, multiple servers may reboot simultaneously when certain updates are made.
In that case https://github.com/pelias/api/issues/1591 is the actual issue. was there a resolution for that?
The difference with this setup in docker swarm is that it's a bit more complicated to setup the "wait for elastic" part of the fix described in 1591
That would involve a custom entrypoint with a shell script. This would require us making our own image extending the official pelias api image.
I'd much rather keep using the official image :)
There's no resolution for https://github.com/pelias/api/issues/1591 yet, we've been working around it where we see it with pelias elastic wait
or similar logic.
While the issue suggests adding retry logic, I was actually thinking another (possibly better?) solution would be to throw a fatal error and shut down the API if the type mapping discovery call fails.
Most systems people use to run the API these days (Kubernetes, Docker, etc) are happy to restart an API process that has shut down completely. And because the type mapping discovery check is only made once at API startup, there's no risk of something like an intermittent connection issue causing a cascading failure later on.
If you are not using any custom data, then you can probably ignore this error, but if you do need that discovery call to succeed, then maybe try adding some error handling here that shuts down the API (process.exit(1) should be fine).
I think we'd be happy to accept a PR for that if it works well for you.
Describe the bug Pelias api can't recover from a elasticsearch cluster reboot, keeps throwing bad requests on api searches. we have nightly server updates and 1 or more servers may reboot at 02:45 AM everytime this happens pelias api fails to recover. either because it loses connection to elastic or because it is restarted and initialises before elastic was up.
because the process doesn't exit, docker swarm can't restart it and thus the api is unusable until manual restart.
Steps to Reproduce Steps to reproduce the behavior:
Expected behavior the container fails and restarts -> keeps retrying the elasticsearch endpoint
Environment (please complete the following information):
Pastebin/Screenshots
Additional context snippet from pelias config:
References
https://github.com/pelias/api/issues/1419 https://github.com/pelias/api/issues/1591 https://github.com/pelias/docker/issues/49