Closed lubojr closed 4 years ago
Sorry for the silence so far. I haven't found time to work on this but I'd like to keep this issue open.
Sorry for closing then.
Hi @lubojr , I don't see why ANALYZE would change the CLUSTER process.
During an import where both write and optimize are done, the optimization step is currently done after all index have been created. This makes CLUSTER much slower as reordering the tables forces PG to update all the existing indexes at the same time.
I suggest (when both write and optimize are done):
This also reduces the required storage during CLUSTER, as normally CLUSTER will make a copy of the data AND all the index.
I did a PR https://github.com/omniscale/imposm3/pull/211 to reorder CLUSTER/CREATE INDEX...
My extract import time is down from 3'45 to 2'40
@cquest This is a great catch and a large improvement in time spent! I unfortunately dont anymore have access to the mentioned server to try the full planet import improvement (mainly the osm_buildings table), but it will probably be huge too. We shall see if this gets integrated into master soon.
I have a full planet import on my workstation, and just launched a comparison on the index effect on CLUSTER. Regarding reclustering, pg_repack is an option to consider. It does the same thing as CLUSTER (or VACUUM FULL) but online, with no lock. I'll give it a try ;)
The benefit to have data geographically clustered is real... it was the subject of one of my presentations during SOTM 2014 in Buenos Aires !
On my planet import, here is the difference when doing a CLUSTER on the osm_buildings table with and without index:
I think we can expect the same improvement on a full import.
More timings on a full planet osm_buildings table...
On more possible improvement is parallel indexing of tables... this allows to take advantage of having the data already in cache. Creating index sequentially on each (large) table causes the all table data to be read from disk for each index (we always have 2 as least, the UNIQUE one and the geometry gist one). I'll check how to implement that... but I'm really not familiar with go (and its multithreading).
ANALYZE could be done after clustering and indexing (I've implemented it, but not tested yet).
Solved by #211 - commit 3b6e6b38f4529ac468580040dff25a0b47a037c9
Context
It is not a bug but a suggestion for an improvement. If imposm -optimize is used, a Geohash indexing and clustering on this index is created. But I think it would be great if also analyze between the creation and clustering itself was done. Analyze is short duration operation, so it should justify performing it twice with the large tables like osm_buildings.
When I look at the results of index usage after the clustering is done and after a deploy (rotation of tables) using a query:
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public';
The results in my case are that all the smaller tables had their geohash index used once, where the number of tuples read and fetched equals the number of rows in the table: _osm_waterways_geom_geohash | 1 | 13836389 | 13836389 osm_places_geom_geohash | 1 | 3728712 | 3728712 osm_landusages_geomgeohash | 1 | 22426817 | 22426817 But the two largest table index are not used: _osm_buildings_geom_geohash | 0 | 0 | 0 osm_minorroads_geomgeohash | 0 | 0 | 0 The log at /var/log/postgresql does not show any errors during the runtime of optimize. So it means that sequential reading was used for clustering instead of index scan - even though I have random_page_cost = 1.1 in the postgresql.conf So I tried to manually DROP the index, ANALYZE, CREATE the index again, ANALYZE CLUSTER the table on that index.CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326))); ANALYZE osm_minorroads; CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads;
Then the index_scan was used (it was 4 times faster than sequential scan and sort on osm_minorroads - 50 minutes vs 4 hours).Possible Fix
Copy the code analyzing the table also between create index and cluster on that index part of database/postgis/postgis.go
Steps to Reproduce
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'public';
DROP osm_minorroads_geom_geohash;
ANALYZE osm_minorroads;
CREATE INDEX osm_minorroads_geom_geohash ON osm_minorroads (ST_GeoHash(ST_Transform(ST_SetSRID(Box2D(geometry), 3857), 4326)));
ANALYZE osm_minorroads;
CLUSTER VERBOSE osm_minorroads_geom_geohash on osm_minorroads;
Does it use index scan or not?Context
This could speed up the optimize part of imposm on large tables.
Your Environment
I could prepare a PR for this if desired.