osm2pgsql-dev / osm2pgsql

OpenStreetMap data to PostgreSQL converter
https://osm2pgsql.org
GNU General Public License v2.0
1.48k stars 473 forks source link

Run update in transaction #2205

Open joto opened 2 months ago

joto commented 2 months ago

Osm2pgsql currently doesn't use transactions. It opens and closes several connections to the database reading and writing data as needed.

This is okay for the initial import, because in typical use you do the import first and when that's done, you start using the data. If something breaks during import, you start from scratch. Using a transaction (that would possibly be open for many hours) doesn't gain us anything.

But for updates the situation is different. Here the use of the database happens in parallel with updates, at least in many cases. It would be easier for users to understand what situation their database is in if we were using transactions. If something fails during the update we could be sure not to have half of the data in the database. Note that the situation is not as bad as this might look, because if an update fails, you will usually fix the situation that lead to the failure and re-start the update from the beginning and it will get you to a defined point again if it runs through this time. The half-updated data from the first try will get overwritten by the final data in most cases. (If you change the config file used, this might not always be the case, though.)

One problem with transactions is that they are usually tied to a database connection. And we use several of them in parallel for performance. But osm2pgsql has a mechanism that allows you to have several connections using the same transaction using the snapshot synchronization functions. This is certainly something we could try.

Then there is the question of the performance impact this would have. It could go either way, so we'd have to test this carefully.

The last issue I see is when osm2pgsql-gen is used. This is currently a separate program which can not share the transaction. But it doesn't have to be an separate program, it was easier to do it this way as long as it is experimental. But we can change that later.

See also #2190, #2110