"docker compose up" leads to incorrect state - Pelias API depends on Elasticsearch

langbein-daniel commented 1 year ago

Describe the bug

I have created a project based on the Germany Docker Compose project.

The error occurs when starting the services or is related to it.

If I starts the project with docker compose up, all services start at the same time and e.g. the API container might be running before the Elasticsearch container. When I then querying the Pelias API, not all (previously imported) data is there: In my example, two layers are missing - no matter how long I keep the services running.

If I do first start Elasticsearch and then a docker compose up, then all data is there.

I do not know where the error comes from. It could be the API container that when it starts before Elasticsearch doesn't see all the layers and assumes that just some default layers exist.

Steps to Reproduce

This should give a more detailed explanation of the discovered bug.

I have created a git repository containing the modified pelias.json, a docker-compose.yml with fewer services (to narrow the problem down) and a README file. Please view the full README to see the exact steps to reproduce the error: https://github.com/langbein-daniel/pelias-docker-bug.

The README contains a bash script that imports the data (GTFS feed into two custom layers stop and station) and shows that by starting first Elasticsearch and then docker compose up, everything works:

# set -e
sudo docker compose up -d elasticsearch

while ! curl --fail 'http://localhost:9200/_cluster/health?wait_for_status=green&timeout=1s' &>/dev/null; do
  echo "Waiting for Elasticsearch ..."
  sleep 2s
done

sudo docker compose run --rm schema ./bin/create_index
# this imports data into two new layers, stop and station
sudo docker compose run --rm gtfs   ./start.sh
sleep 10s
sudo docker compose up

# Then open http://localhost:4000/v1/search?text=n%C3%BCrnberg&lang=en&layers=stop,station
#   No error about the layers. Just about missing services
#   which were removed from docker-compose.yml to minimize the example.
#    "errors": [
#      "getaddrinfo ENOTFOUND placeholder"
#    ],

And it shows that by starting all services at once, the custom layers are reported missing:

sudo docker compose up
# Then open http://localhost:4000/v1/search?text=n%C3%BCrnberg&lang=en&layers=stop,station
#    "errors": [
#      "'stop' is an invalid layers parameter. Valid options: coarse,address,venue,street,country,macroregion,region,county,localadmin,locality,borough,neighbourhood,continent,empire,dependency,macrocounty,macrohood,microhood,disputed,postalcode,ocean,marinearea",
#      "'station' is an invalid layers parameter. Valid options: coarse,address,venue,street,country,macroregion,region,county,localadmin,locality,borough,neighbourhood,continent,empire,dependency,macrocounty,macrohood,microhood,disputed,postalcode,ocean,marinearea"
#    ],

Expected behavior

If the problem is that the API server has started too early, it should wait until Elasticsearch is up and then properly detect all layers.

Environment (please complete the following information):

OS: Arch Linux
uname -a: Linux yodaTux 6.2.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 26 Feb 2023 03:39:23 +0000 x86_64 GNU/Linux
Docker versions:
- docker -v: Docker version 23.0.1, build a5ee5b1dfc
- docker-compose -v: Docker Compose version 2.16.0

Workaround

As a workaround, one can add this to the Elasticsearch service:

    healthcheck:
      test: curl --fail http://localhost:9200/_cluster/health?wait_for_status=green&timeout=1s || exit 1
      interval: 10s
      retries: 5
      start_period: 20s
      timeout: 5s

and this to the API service:

    depends_on:
      elasticsearch:
        condition: service_healthy

With these two changes, one can start the project with just docker compose up and all data/layers are detected.

If you like the two changes, I'm happy to create a pull request and add this to the sample projects.

But maybe it is better to find out if it really is the API service that causes the problem and maybe fix its startup procedure to wait for Elasticsearch.

langbein-daniel commented 1 year ago

Regarding the workaround: According to here, the long syntax for "depends_on" was removed with docker compose v3 and reintroduced with docker compose v3.9. So it is probably too early to use this in all examples. But I could test if a simpledepends_on: elasticsearch at the API service is enough.

missinglink commented 1 year ago

This is related to the auto_discover flag, the pelias/api instance(s) will query elasticsearch upon startup for a list of sources/layers, so that any custom data can be detected.

In your case there is a race condition, once auto-detect has been run once the result is cached so that features such as negative filters can be performant (ie. all layers except one)

It doesn't indicate that the data is missing from the index, just that the warmup wasn't complete at the time it was queried.

I don't recall the specifics, but you should be able to search for it in the pelias/api codebase, there is a config flag to disable it.

https://github.com/pelias/config/pull/127

Now all that being said, the pelias elastic wait command exists for this purpose, it blocks until elastic is running and ready to accept traffic.

IIRC docker compose doesn't have a HTTP 'healthcheck' feature so I'm not sure if it can be solved in a way that docker compose up 'just-works'.

Simply checking that the process spun up and is running isn't sufficient to determine that it can accept traffic, so be careful that depends_on does what you expect it to.

missinglink commented 1 year ago

Oh sorry I just read your comment again and apparently it is possible now, so that's cool.

missinglink commented 1 year ago

One of the difficulties of maintaining a ~9yo project used by lots of different systems is maintaining backwards compatibility.

Can we improve this without breaking things for existing users?

I guess that question mainly asks whether we can stay on compose format v2 and whether depends_on and healthcheck are available for all versions greater than the min versions we publish on the readme.

langbein-daniel commented 1 year ago

This is related to the auto_discover flag, the pelias/api instance(s) will query elasticsearch upon startup for a list of sources/layers, so that any custom data can be detected.

In your case there is a race condition, once auto-detect has been run once the result is cached so that features such as negative filters can be performant (ie. all layers except one)

It doesn't indicate that the data is missing from the index, just that the warmup wasn't complete at the time it was queried.

I don't recall the specifics, but you should be able to search for it in the pelias/api codebase, there is a config flag to disable it.

pelias/config#127

Now all that being said, the pelias elastic wait command exists for this purpose, it blocks until elastic is running and ready to accept traffic.

IIRC docker compose doesn't have a HTTP 'healthcheck' feature so I'm not sure if it can be solved in a way that docker compose up 'just-works'.

Simply checking that the process spun up and is running isn't sufficient to determine that it can accept traffic, so be careful that depends_on does what you expect it to.

Thanks for the detailed information. The pelias CLI is definitely helpful for import and startup of Pelias services.

But there are still two things that come to mind:

1)

In the README of this project, there is a section https://github.com/pelias/docker#quickstart-build-script showing the commands to build and start it. When I first read it, I thought that the pelias elastic wait is just needed for the subsequent pelias import all (as this executes scripts in Docker containers that import into the DB). But if one stops all services (e.g. during server restart) and then starts them with pelias compose up, then the same race problem occurs.

So I think that it would be good to either add a short section to the README on how to start the project again (without reimporting data) with pelias elastic start && pelias elastic wait && pelias compose up or to adjust the pelias compose up-command: This command could check if Elasticsearch is already running and if not start it with pelias elastic start && pelias elastic wait before running docker-compose up.

2)

I was thinking about a setup using Docker "data-images". On a build machine, one runs the download/prepare/import steps and then creates "data-images" for each service of Pelias containing just the data required during normal operation. These can be uploaded to a container registry. On the production machine only a docker-compose installation and a docker-compose.yml file are required. The compose file is similar to those of the examples but shorter as only the services used during normal operation are included (not those just for download/prepare/import/DB-setup) and the image tags are replaced with those from the data-images.

This way a simple docker compose up --pull always on the production server is enough to pull the latest data-images and restart the services.

... If there wasn't the race problem. A simple solution to this is to restart the API service after some seconds. So it's not that big a problem. But having the healthcheck for Elasticsearch and the depends_on elasticsearch at the API service sounds cleaner to me.

Do you think that an example that builds Docker "data-images" (after running usual pelias CLI commands) with an additional compose file for deployment (e.g. on a different) server would be a good addition to this project? I will try to implement this for a small region and would be happy to share.

langbein-daniel commented 1 year ago

I guess that question mainly asks whether we can stay on compose format v2 and whether depends_on and healthcheck are available for all versions greater than the min versions we publish on the readme.

I will have a look into this.

missinglink commented 1 year ago

Do you think that an example that builds Docker "data-images" (after running usual pelias CLI commands) with an additional compose file for deployment (e.g. on a different) server would be a good addition to this project?

It sounds like a good idea, but I fear that this would encourage developers to think that running docker compose in a production environment is a good idea. This workflow is simply provided for development.

For a production deployment I would recommend kubernetes, we have the scripts available in another repo, and for elasticsearch snapshots people's preferences differ, but we use the s3 elasticsearch plugin to save/load to Amazon s3.

missinglink commented 1 year ago

I'm also not sure how well suited a container registry is for storing very large data volumes, I guess people do this but I don't have any first-hand experience with it, so can't comment on how good/bad an idea that is in practice.

orangejulius commented 1 year ago

@langbein-daniel you might be interested in the api.targets.auto_discover_required config option added in https://github.com/pelias/api/pull/1624.

For backwards compatibility, it defaults to disabled. However, when turned on the API will quit if it can't connect to Elasticsearch and fetch the type mapping on startup. Because docker-compose generally restarts services that quit, eventually Elasticsearch will come up, and then the API will successfully load the type mapping, and all will be well.

langbein-daniel commented 1 year ago

Thank you @missinglink for the further feedback. I agree that Kubernetes is better for production than a docker-compose file. But by creating data containers one gets closer to that setup without the requirement of a Kubernetes cluster. So I'll still give that setup a try ;)

Since the examples explain a Pelias setup in a practical way, it might be better to restrict them to the western thing as much as possible. But the offer to share my setup as a separate example still exists.

And thanks @orangejulius for pointing me to that config option. By enabling this a pelias compose up brought everything up and running. (The API service restarted about 4 times until Elasticsearch was up, but it works!)

I'll close this as completed as multiple ways to solve it have been discussed. Many thanks again for the help!

Speaks anything against setting api.targets.auto_discover_required to true in the examples that do already use a recent version of the API service? The versions from e.g. the Germany example are already recent enough.

Or what do you think about adding some notes about the startup to the README (pelias elastic start && pelias elastic wait && pelias compose up)?

pelias / docker