osm2pgsql-dev / osm2pgsql

OpenStreetMap data to PostgreSQL converter
https://osm2pgsql.org
GNU General Public License v2.0
1.48k stars 473 forks source link

Stage 2 flex processing is slow (possible issue with DELETE statement?) #1640

Closed mboeringa closed 2 years ago

mboeringa commented 2 years ago

What version of osm2pgsql are you using?

osm2pgsql version 1.6.0 (1.3.0-580-g061d4013), compiled from source.

What operating system and PostgreSQL/PostGIS version are you using?

Ubuntu 20.04.1 LTS, PostgreSQL 14.2, PostGIS 3.2.0.

Tell us something about your system

Bare metal Windows Hyper-V VM with 190 GB RAM assigned to VM, 2x Xeon E5-2680 v4, 14C/28T each, 4x2TB NVMe PCIe 3.0 Windows Storage Space on HP Z840 workstation.

What did you do exactly?

Imported Planet using a slightly modified version of the in-development flex version of 'openstreetmap-carto' Lua style file as available here: https://github.com/gravitystorm/openstreetmap-carto/blob/e7af9bd90799103955f1d3996201ce0904be1665/openstreetmap-carto.lua

What did you expect to happen?

The style mentioned above has special handling for administrative boundaries, and creates a custom 'planet-osm-admin' line geometry table with de-duplicated lines representing administrative boundaries of the highest level using stage 2 flex processing.

The amount of data to be processed here is significant, as being administrative boundaries, but the total number of records to re-process limited, osm2pgsql reports:

"There are 2335433 ways to reprocess"

I would expect this to execute relatively fast considering the limited number of records of just over 2M.

What did happen instead?

The re-processing "appears" slow. Note that this is a completely arbitrary statement, as I don't know exactly what is going on in this stage (except for what it needs to create as output). I just notice it takes many hours of processing for a fairly limited amount of records (albeit being administrative boundaries, which can be huge).

What did you do to try analyzing the problem?

One thing I noticed, and this is actually why I opened this issue, is that at this stage the only thing I see happening on the database - at least from the limited perspective of what the pgAdmin interface shows - is that pgAdmin shows a cycle of DELETE statements, where the WHERE clause appears to show batches of records being deleted:

afbeelding

This makes me wonder if the actual issue is not so much the processing for stage 2 flex processing, or that the slowness might be caused by this DELETE operation going in batches over the records to delete using large WHERE clauses with lots of IN record references.

Wouldn't it possibly be much more efficient and faster to create a secondary table with the osm_id of the records to delete, and then use a

DELETE FROM <table1> USING <table2> WHERE <table1>.osm_id = <table2>.osm_id

type DELETE record statement with USING as also shown and documented on the PostgreSQL Help pages:

https://www.postgresql.org/docs/14/sql-delete.html

I know from my own experience in another application I wrote, that this seems quite efficient at deleting large numbers of records from an existing table. The PostgreSQL Help also seems to recommend it, although in reference to a comparison with a sub-select there:

In some cases the join style is easier to write or faster to execute than the sub-select style.

lonvia commented 2 years ago

Please provide the actual style used.

mboeringa commented 2 years ago

I attached the style as ZIP.

However, I don't think it is actually that relevant. This is more of a generic question: is it efficient to delete large numbers of records in batches using IN statements, or is it potentially faster to process using the suggested SQL statement? I have never tested this in PostgreSQL, and don't know if there are potential differences, although the PostgreSQL Help suggests there may be differences depending on the used SQL (in relation to sub-select versus USING clause there though).

openstreetmap-carto.zip

lonvia commented 2 years ago

Ah, so you are not reporting an issue but want to open a discussion on use of SQL. Let me move the issue into the discussion section then.