openstreetmap / chef

Chef configuration management repo for configuring & maintaining the OpenStreetMap servers.
Apache License 2.0
97 stars 66 forks source link

Recluster rendering servers #86

Closed pnorman closed 8 years ago

pnorman commented 8 years ago

According to @cquest the OSM FR servers gain over 25%-50% rendering throughput when they recluster them after about a year by reducing table and index bloat. This can be done without a full outage and only stopping updates, but we probably want to wait until #79 is done to increase capacity. The reclustering depends on database IO and CPU, which are not maxed out.

The overall pan is to create a new copy of tables, build new indexes, then replace old tables. Because update frequency is more important for osm.org than other hosts, I'd recommend doing it slightly differently. Instead of reclustering all the tables, do one table, resume updates and let them catch up, do another, etc.

Starting with the points table and progressing by table size minimizes the disk usage. I think there's enough free space that it doesn't matter, but this is a best practice.


Process

My recommendation is the following is done on both servers, starting with whichever has gone the longest since the initial import

  1. Record the results of \dt+ and \di+. It would also be useful to have the results of the following SQL for future planning purposes

    SELECT CORR(page,geohash)
     FROM (
       SELECT 
           (ctid::text::point)[0] AS page,
           rank() OVER (ORDER BY St_GeoHash(st_transform(way,4326))) AS geohash
         FROM planet_osm_point
       ) AS s; -- area server result .93, takes 461s
    SELECT CORR(page,geohash)
     FROM (
       SELECT
           (ctid::text::point)[0] AS page,
           rank() OVER (ORDER BY St_GeoHash(st_transform(way,4326))) AS geohash
         FROM planet_osm_roads
       ) AS s; -- area server result .58, takes 119s
    SELECT CORR(page,geohash)
     FROM (
       SELECT
           (ctid::text::point)[0] AS page,
           rank() OVER (ORDER BY St_GeoHash(st_transform(way,4326))) AS geohash
         FROM planet_osm_line
       ) AS s;
    SELECT CORR(page,geohash)
     FROM (
       SELECT 
           (ctid::text::point)[0] AS page,
           rank() OVER (ORDER BY St_GeoHash(st_transform(way,4326))) AS geohash
         FROM planet_osm_polygon
       ) AS s;
  2. Stop updates and make a backup of the state file.
  3. Start by creating a schema to do work in

    CREATE SCHEMA IF NOT EXISTS recluster;
  4. Starting with the smallest table, recluster it into the new schema.

    \timing
    SET search_path TO recluster,"$user",public;
    CREATE TABLE planet_osm_point AS
     SELECT * FROM public.planet_osm_point
       ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C";
  5. Create indexes. The indexes here are the recommended ones for OpenStreetMap Carto. If you want to use others you can.

    \timing
    SET search_path TO recluster,"$user",public;
    CREATE INDEX planet_osm_point_place
     ON planet_osm_point USING GIST (way)
     WHERE place IS NOT NULL AND name IS NOT NULL;
    CREATE INDEX planet_osm_point_index
     ON planet_osm_point USING GIST (way);
    
    CREATE INDEX planet_osm_point_pkey
     ON planet_osm_point (osm_id);
  6. Replace the table in the public schema in a transaction, keeping the old one

    CREATE SCHEMA IF NOT EXISTS backup;
    BEGIN;
    ALTER TABLE public.planet_osm_point
     SET SCHEMA backup;
    ALTER TABLE recluster.planet_osm_point
     SET SCHEMA public;
    COMMIT;
  7. Verify that tiles are still rendering
  8. Drop the old table

    DROP TABLE backup.planet_osm_point;
  9. Resume updates. When updates are done, repeat for the other three rendering tables
  10. For planet_osm_roads

    \timing
    SET search_path TO recluster,"$user",public;
    CREATE TABLE planet_osm_roads AS
      SELECT * FROM public.planet_osm_roads
        ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C";
    
    CREATE INDEX planet_osm_roads_admin
      ON planet_osm_roads USING GIST (way)
      WHERE boundary = 'administrative';
    CREATE INDEX planet_osm_roads_roads_ref
      ON planet_osm_roads USING GIST (way)
      WHERE highway IS NOT NULL AND ref IS NOT NULL;
    CREATE INDEX planet_osm_roads_admin_low
      ON planet_osm_roads USING GIST (way)
      WHERE boundary = 'administrative' AND admin_level IN ('0', '1', '2', '3', '4');
    CREATE INDEX planet_osm_roads_index
      ON planet_osm_roads USING GIST (way);
    
    CREATE INDEX planet_osm_roads_pkey
      ON planet_osm_roads (osm_id);
    
    BEGIN;
    ALTER TABLE public.planet_osm_roads
      SET SCHEMA backup;
    ALTER TABLE recluster.planet_osm_roads
      SET SCHEMA public;
    COMMIT;

    Test, then

    \timing
    DROP TABLE backup.planet_osm_roads;
  11. For planet_osm_line, resume updates, wait for updates to catch up, then

    \timing
    SET search_path TO recluster,"$user",public;
    CREATE TABLE planet_osm_line AS
      SELECT * FROM public.planet_osm_line
        ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C";
    
    CREATE INDEX planet_osm_line_ferry
      ON planet_osm_line USING GIST (way)
      WHERE route = 'ferry';
    CREATE INDEX planet_osm_line_river
      ON planet_osm_line USING GIST (way)
      WHERE waterway = 'river';
    CREATE INDEX planet_osm_line_name
      ON planet_osm_line USING GIST (way)
      WHERE name IS NOT NULL;
    CREATE INDEX planet_osm_line_index
      ON planet_osm_line USING GIST (way);
    
    CREATE INDEX planet_osm_line_pkey
      ON planet_osm_line (osm_id);
    
    BEGIN;
    ALTER TABLE public.planet_osm_line
      SET SCHEMA backup;
    ALTER TABLE recluster.planet_osm_line
      SET SCHEMA public;
    COMMIT;

    Test then

    DROP TABLE backup.planet_osm_line;
  12. Polygons will take the longest. Resume updates and let them catch up, then stop them and

    \timing
    SET search_path TO recluster,"$user",public;
    CREATE TABLE planet_osm_line AS
    SELECT * FROM public.planet_osm_line
      ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(way),4326),10) COLLATE "C";
    
    CREATE INDEX planet_osm_polygon_military
      ON planet_osm_polygon USING GIST (way)
      WHERE landuse = 'military';
    CREATE INDEX planet_osm_polygon_nobuilding
      ON planet_osm_polygon USING GIST (way)
      WHERE building IS NULL;
    CREATE INDEX planet_osm_polygon_name
      ON planet_osm_polygon USING GIST (way)
      WHERE name IS NOT NULL;
    CREATE INDEX planet_osm_polygon_way_area_z6
      ON planet_osm_polygon USING GIST (way)
      WHERE way_area > 59750;
    
    CREATE INDEX planet_osm_polygon_index
      ON planet_osm_polygon USING GIST (way);
    
    CREATE INDEX planet_osm_polygon_pkey
      ON planet_osm_polygon (osm_id);
    
    BEGIN;
    ALTER TABLE public.planet_osm_polygon
      SET SCHEMA backup;
    ALTER TABLE recluster.planet_osm_polygon
      SET SCHEMA public;
    COMMIT;

    Test then

    \timing
    DROP TABLE backup.planet_osm_polygon;
  13. Resume rendering and clean up with
  DROP SCHEMA recluster;
  DROP SCHEMA backup;
  1. Record the results of \dt+ and \di+ again.
  2. Verify that there is a speed increase then do the other server

Ref: http://paulnorman.ca/blog/2016/06/improving-speed-with-reclustering/

Other options

In case of a problem a rollback can be done by restoring the table from the backup schema.

If diffs are mistakenly restarted early the state file needs to be reset and diffs re-run.

Why not wait for a reimport?

The OpenStreetMap Carto Lua branch which will require a reimport with hstore is not out of development. We have a few open issues before we can merge and are lacking in deveoper time for these issues. Once the lua branch is merged we will still be releasing 2.x releases which will work with the old database to allow time to change over.

Doing a reimport with the current settings is in some ways better, but requires either a full outage of the server, a fair amount of database disk space, or the possibility of updates being down for an extended time[1], and the certainty that updates will be stopped for about a day.

[1] If the old DB slim tables are dropped this saves room, but stops any updates on the old DB

Time required

I'm running a test on the server for testing old-style multipolygons. It's got faster single-threaded performance and absurdly faster drives, but it should give an indication. I'll add times when it's done.

tomhughes commented 8 years ago

This seems to be in the wrong place, unless you're seriously suggesting that we turn all that into a chef recipe...

Personally I really don't want to get involved with doing something that insanely complicated and risky.

We have a third tile server coming hopefully in a matter of weeks at which point my plan was to reimport everything on all three servers anyway. I was assuming that would be with the new style but if it has to be with the old one then fine.

pnorman commented 8 years ago

Personally I really don't want to get involved with doing something that insanely complicated and risky.

Reclustering and reindexing is standard maintenance. In the future we will be able to do so regularly?

zerebubuth commented 8 years ago

It sounds like, from the above instructions, that this can be done "online" (apart from stopped diffs) and doesn't interrupt rendering so could it potentially be automated?

If it's possible to write a script which does this, and incorporate that into a quarterly / bi-annual / annual run, then that would be great. It sounds like otherwise it's a lot of manual CLI work interspersed with periods of waiting.

Also, doing each table like this means it only requires the overhead of duplicating the largest table, right?

pnorman commented 8 years ago

It sounds like, from the above instructions, that this can be done "online" (apart from stopped diffs) and doesn't interrupt rendering so could it potentially be automated?

Yes, doing it offline is much simpler.

Also, doing each table like this means it only requires the overhead of duplicating the largest table, right?

Yes, if there's enough room and IO capacity I'd do all the tables at once in parallel, but this way reduces the extra space required.

tomhughes commented 8 years ago

Well there's no way that something that complicated and long running can be run directly from chef so the only way it could be done is by writing a script that chef installs which runs from cron.

But it's so intricate that I'm really reluctant to run it automatically with no human supervision. I haven't read it all but the fact there are 13 steps and vast amounts of custom SQL scares the pants of me. What are the recovery measures if any step goes wrong would be my first question? What is the risk of breaking something in a way such that there is no way to back out?

In any case there are all sorts of non-automatable things in there like "check tiles are still rendering" and "wait for updates to complete".

pnorman commented 8 years ago

But it's so intricate that I'm really reluctant to run it automatically with no human supervision. I haven't read it all but the fact there are 13 steps and vast amounts of custom SQL scares the pants of me.

Most of the SQL is taken from openstreetmap-carto.

What are the recovery measures if any step goes wrong would be my first question? What is the risk of breaking something in a way such that there is no way to back out?

Recovery steps are there, but not clear enough.

Based on feedback, I'll write a script that handles the SQL. I think I can even handle locking in case updates were not stopped.

zerebubuth commented 8 years ago

Based on feedback, I'll write a script that handles the SQL. I think I can even handle locking in case updates were not stopped.

Great! Thanks.

We do run many "long-running" processes on various servers; database dumps, planet file conversions, etc... in an automated, regular fashion. So we just need to try and make sure that it interferes with the regular rendering as little as possible, and doesn't wedge itself too often.

I was wondering if it's possible to wait for osm2pgsql to finish, and stop it from running, by watching and writing its lock file? That would mean there's no need to disable its cron, which is what I usually do when I'm stopping it manually.

cquest commented 8 years ago

The first time I did a reclustering, it reduced I/O by 80% :)

Then I can see I/O slowly going up again because of fragmented data in postgresl pages. So I'm doing that every 6 months and usually get I/O divided by 2.

I'm now stopping the updates, recluster one table, let updates run again, then do another one, etc... so all the process is done online and updates are not stopped too long.

Regarding slim tables, there is a huge benefit to reindex planet_osm_ways... its GIN index gets larger than the data part !

I'm also planning to test pg_repack which is supposed to do the same as RECLUSTER but without requiring exclusive access lock on the table.

Don't forget the GRANT on the new tables ;)

On more step is ext4 defrag on the postgresql files... it helps the kernel merging I/Os to more than 8KB pages which even for many SSD are very small chunks of data where they are not so efficient. This is theorical, I've not checked the real benefit.

zerebubuth commented 8 years ago

Thanks! That's the first time I've heard of pg_repack and it looks really useful. It sounds like it might be an excellent option for regular online re-clustering. Please let us know how the testing goes.

pnorman commented 8 years ago

I was wondering if it's possible to wait for osm2pgsql to finish, and stop it from running, by watching and writing its lock file? That would mean there's no need to disable its cron, which is what I usually do when I'm stopping it manually.

I thought updates were a daemon, not a cron job?

Writing the lock file would stop it if you write it between runs. If you write it during a run, it'll do nothing since it'll clean up at the end.

Thanks! That's the first time I've heard of pg_repack and it looks really useful.

pg_repack won't work for the rendering tables as it requires a UNIQUE index on a NOT NULL column. It would be useful for the slim tables, and on the API database.

zerebubuth commented 8 years ago

I thought updates were a daemon, not a cron job?

Yup, you're right, it is. Although it's very simple so would be easy to convert to cron if we wanted. But it seems like there's no reason to at the moment.

Writing the lock file would stop it if you write it between runs. If you write it during a run, it'll do nothing since it'll clean up at the end.

What I had in mind was something like:

  1. Try to create osm2pgsql.lock. If it already exists, wait on it being deleted (with inotify or something) and retry until it exists because we created it.
  2. Do work, safe in the knowledge that osm2pgsql won't run.
  3. Delete the lock on exit.

The only, rather major, flaw is that this isn't how we run osm2pgsql at all! So it looks like table-level exclusive locking would work better, assuming that osm2pgsql will just wait for the lock rather than error.

pg_repack won't work for the rendering tables as it requires a UNIQUE index on a NOT NULL column.

That's a shame. I'm almost tempted to see how long an ALTER TABLE ADD COLUMN tmp_id bigserial would take, but the answer is probably "too long". And dropping it afterwards might just leave a bunch of bloat in its place.