Run Analyze between create geohash index and clustering itself

lubojr commented 6 years ago

Context

It is not a bug but a suggestion for an improvement. If imposm -optimize is used, a Geohash indexing and clustering on this index is created. But I think it would be great if also analyze between the creation and clustering itself was done. Analyze is short duration operation, so it should justify performing it twice with the large tables like osm_buildings.

When I look at the results of index usage after the clustering is done and after a deploy (rotation of tables) using a query: SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public'; The results in my case are that all the smaller tables had their geohash index used once, where the number of tuples read and fetched equals the number of rows in the table: _osm_waterways_geom_geohash | 1 | 13836389 | 13836389 osm_places_geom_geohash | 1 | 3728712 | 3728712 osm_landusages_geomgeohash | 1 | 22426817 | 22426817 But the two largest table index are not used: _osm_buildings_geom_geohash | 0 | 0 | 0 osm_minorroads_geomgeohash | 0 | 0 | 0 The log at /var/log/postgresql does not show any errors during the runtime of optimize. So it means that sequential reading was used for clustering instead of index scan - even though I have random_page_cost = 1.1 in the postgresql.conf So I tried to manually DROP the index, ANALYZE, CREATE the index again, ANALYZE CLUSTER the table on that index. CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326))); ANALYZE osm_minorroads; CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads; Then the index_scan was used (it was 4 times faster than sequential scan and sort on osm_minorroads - 50 minutes vs 4 hours).

Possible Fix

Copy the code analyzing the table also between create index and cluster on that index part of database/postgis/postgis.go

Steps to Reproduce

Full planet import of osm_buildings or osm_minorroads with -write -optimize with example mapping
-deploy
query: SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public';
See if the largest tables were sequentially read (index was not used). In my case, yes. Now to see if it changes anything to analyze between.
DROP osm_minorroads_geom_geohash;
ANALYZE osm_minorroads;
CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326)));
ANALYZE osm_minorroads;
CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads; Does it use index scan or not?
Context

This could speed up the optimize part of imposm on large tables.

Your Environment
- Version used: Latest commit before 0.6.0 alpha release own build with LevelDB >1.21 support.
- PostgreSQL 9.5.12 with PostGIS 2.2.1
- Operating System and version + spec: Ubuntu Linux 16.04, 64GB RAM, 2TB HDD (500GB occupied), 8 core processor
- PostgreSQL settings: maintenance_work_mem = 16GB , random_page_cost = 1.1 Data Size: osm_buildings: 52GB (279 mil rows) osm_minorroads: 28GB (103 mil rows) osm_landusages: 12GB (22 mil rows)

I could prepare a PR for this if desired.

olt commented 5 years ago

Sorry for the silence so far. I haven't found time to work on this but I'd like to keep this issue open.

lubojr commented 5 years ago

Sorry for closing then.