openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Switch to osm2pgsql expiry and osm2pgsql-replication #987

Closed pnorman closed 1 year ago

pnorman commented 1 year ago

One of the common rendering complaints is due to not dirtying tiles when relations change. Switching to osm2pgsql expiry will fix this, as well as allow us to use osm2pgsql-replication, significantly simplifying tile replication scripts. I was setting up a client machine and I was astonished how simple it is these days.

osm2pgsql expiry for the pgsql backend runs in hybrid mode which expires all tiles in a multipolygon below full_area_limit and only the boundary above. The default value for full_area_limit is 20000 but this can be changed with --expire-bbox-size. This means if someone makes a tag edit to the US boundary all the tiles along the edge will be expired, but not the whole of the US.

osm2pgsql comes with osm2pgsql-replication, which stores all state in the database.

The setup command is just osm2pgsql-replication init -d gis and it will determine the date from the data in the database. On a new import it would be osm2pgsql-replication init -d gis --osm-file planet-latest.osm.pbf.

We need a script to take the tile list and touch the relevant files to indicate they need re-rendering. Switch2OSM documents this. Adapting their example slightly, the following would be /usr/local/bin/expire-tiles

#!/bin/sh
set -e
render_expired --map=default --touch-from=13 --min-zoom=13 --max-zoom=19 -s /var/run/renderd/renderd.sock < /var/lib/replicate/dirty_tiles.txt
rm /var/lib/replicate/dirty_tiles.txt

The key is --touch-from. When expiring tiles above that zoom, it touches them to indicate they are stale. We should consider adjusting these later.

Ignoring error handling and logging, the command that needs to be run is osm2pgsql-replication update -d gis --post-processing /usr/local/bin/expire-tiles -- --log-progress=false --number-processes=1 --expire-tiles=13-19 --expire-output=/var/lib/renderd/dirty_tiles.txt. Settings like hstore, multi-geometry, flat nodes, and style are all stored in the DB by osm2pgsql and are not required when running with --append.

This will download up to 500MB (or otherwise if --max-diff-size is set), apply it to the DB, store the tiles list, and run post-processing.

To run this on a regular schedule, Switch2OSM recommends a cron job, but osm2pgsql recommends a systemd service. systemd services are better for this

For that we'd create /etc/systemd/system/osm2pgsql-update.service

[Unit]
Description=Keep osm2pgsql database up-to-date

[Service]
WorkingDirectory=/tmp
ExecStart=osm2pgsql-replication update -d gis --post-processing /usr/local/bin/expire-tiles -- --log-progress=false --number-processes=1 --expire-tiles=13-19 --expire-output=/var/lib/replicate/dirty_tiles.txt
StandardOutput=append:/var/log/osm2pgsql-updates.log
User=_renderd
Type=simple
Restart=on-failure
RestartSec=5min

And a timer in /etc/systemd/system/osm2pgsql-update.timer

[Unit]
Description=Trigger a osm2pgsql database update

[Timer]
OnBootSec=10
OnUnitActiveSec=30s

[Install]
WantedBy=timers.target

Note for anyone doing this themselves, they'd also have to enable the timer and start it the first time - see osm2pgsql docs.

This will eliminate replicate.erb, expire-tiles.erb, expire.rb, and expire-tiles-single

Possible issues and notes

tomhughes commented 1 year ago

I think we also need to include --multi-geometry, --hstore and --tag-transform-script=/srv/tile.openstreetmap.org/styles/default/openstreetmap-carto.lua in the osm2pgsql options as none of those seem to be preserved in the properties table in the database?

lonvia commented 1 year ago

That's correct. All arguments that are only for the pgsql output still need to be given manually to osm2pgsql-replication.

tomhughes commented 1 year ago

This has now been deployed on balerion and comparison with bowser shows that it has increased the amount of rendering which is expected now we're dirtying more tiles.

pnorman commented 1 year ago

balerion's CPU usage is about 2x that of bowser, with more high peaks when some relations get touched. Not a huge impact on disk utilization, and IO pressure is approximately double, it's also <2% in both cases so the disks are nowhere maxed out.

I think all the servers have sufficient capacity to take the increased load.

tomhughes commented 1 year ago

This has now been rolled out across all eight servers.