omniscale / imposm3

Imposm imports OpenStreetMap data into PostGIS
http://imposm.org/docs/imposm3/latest/
Apache License 2.0
711 stars 156 forks source link

Run Analyze between create geohash index and clustering itself #166

Closed lubojr closed 4 years ago

lubojr commented 6 years ago

Context

It is not a bug but a suggestion for an improvement. If imposm -optimize is used, a Geohash indexing and clustering on this index is created. But I think it would be great if also analyze between the creation and clustering itself was done. Analyze is short duration operation, so it should justify performing it twice with the large tables like osm_buildings.

When I look at the results of index usage after the clustering is done and after a deploy (rotation of tables) using a query: SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public'; The results in my case are that all the smaller tables had their geohash index used once, where the number of tuples read and fetched equals the number of rows in the table: _osm_waterways_geom_geohash | 1 | 13836389 | 13836389 osm_places_geom_geohash | 1 | 3728712 | 3728712 osm_landusages_geomgeohash | 1 | 22426817 | 22426817 But the two largest table index are not used: _osm_buildings_geom_geohash | 0 | 0 | 0 osm_minorroads_geomgeohash | 0 | 0 | 0 The log at /var/log/postgresql does not show any errors during the runtime of optimize. So it means that sequential reading was used for clustering instead of index scan - even though I have random_page_cost = 1.1 in the postgresql.conf So I tried to manually DROP the index, ANALYZE, CREATE the index again, ANALYZE CLUSTER the table on that index. CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326))); ANALYZE osm_minorroads; CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads; Then the index_scan was used (it was 4 times faster than sequential scan and sort on osm_minorroads - 50 minutes vs 4 hours).

Possible Fix

Copy the code analyzing the table also between create index and cluster on that index part of database/postgis/postgis.go

Steps to Reproduce

  1. Full planet import of osm_buildings or osm_minorroads with -write -optimize with example mapping
  2. -deploy
  3. query: SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public';
  4. See if the largest tables were sequentially read (index was not used). In my case, yes. Now to see if it changes anything to analyze between.
  5. DROP osm_minorroads_geom_geohash;
  6. ANALYZE osm_minorroads;
  7. CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326)));
  8. ANALYZE osm_minorroads;
  9. CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads; Does it use index scan or not?

    Context

    This could speed up the optimize part of imposm on large tables.

    Your Environment

    • Version used: Latest commit before 0.6.0 alpha release own build with LevelDB >1.21 support.
    • PostgreSQL 9.5.12 with PostGIS 2.2.1
    • Operating System and version + spec: Ubuntu Linux 16.04, 64GB RAM, 2TB HDD (500GB occupied), 8 core processor
    • PostgreSQL settings: maintenance_work_mem = 16GB , random_page_cost = 1.1 Data Size: osm_buildings: 52GB (279 mil rows) osm_minorroads: 28GB (103 mil rows) osm_landusages: 12GB (22 mil rows)

I could prepare a PR for this if desired.

olt commented 5 years ago

Sorry for the silence so far. I haven't found time to work on this but I'd like to keep this issue open.

lubojr commented 5 years ago

Sorry for closing then.

cquest commented 4 years ago

Hi @lubojr , I don't see why ANALYZE would change the CLUSTER process.

During an import where both write and optimize are done, the optimization step is currently done after all index have been created. This makes CLUSTER much slower as reordering the tables forces PG to update all the existing indexes at the same time.

I suggest (when both write and optimize are done):

  1. CREATE the geohash INDEX + maybe ANALYZE if it really improves the clustering step
  2. CLUSTER the table
  3. DROP the geohash INDEX (which is also missing, unless you want to recluster later for example with pg_repack)
  4. CREATE all other INDEX

This also reduces the required storage during CLUSTER, as normally CLUSTER will make a copy of the data AND all the index.

cquest commented 4 years ago

I did a PR https://github.com/omniscale/imposm3/pull/211 to reorder CLUSTER/CREATE INDEX...

My extract import time is down from 3'45 to 2'40

lubojr commented 4 years ago

@cquest This is a great catch and a large improvement in time spent! I unfortunately dont anymore have access to the mentioned server to try the full planet import improvement (mainly the osm_buildings table), but it will probably be huge too. We shall see if this gets integrated into master soon.

cquest commented 4 years ago

I have a full planet import on my workstation, and just launched a comparison on the index effect on CLUSTER. Regarding reclustering, pg_repack is an option to consider. It does the same thing as CLUSTER (or VACUUM FULL) but online, with no lock. I'll give it a try ;)

The benefit to have data geographically clustered is real... it was the subject of one of my presentations during SOTM 2014 in Buenos Aires !

cquest commented 4 years ago

On my planet import, here is the difference when doing a CLUSTER on the osm_buildings table with and without index:

I think we can expect the same improvement on a full import.

cquest commented 4 years ago

More timings on a full planet osm_buildings table...

On more possible improvement is parallel indexing of tables... this allows to take advantage of having the data already in cache. Creating index sequentially on each (large) table causes the all table data to be read from disk for each index (we always have 2 as least, the UNIQUE one and the geometry gist one). I'll check how to implement that... but I'm really not familiar with go (and its multithreading).

ANALYZE could be done after clustering and indexing (I've implemented it, but not tested yet).

lubojr commented 4 years ago

Solved by #211 - commit 3b6e6b38f4529ac468580040dff25a0b47a037c9