Closed elpaso closed 2 years ago
CC @robsavoye assigned to you for review, I will take care of the implementation when the implementation plan is accepted.
This script (along with some pyosmium import) supports all of those use cases out of the box already: https://github.com/openstreetmap/osm2pgsql/blob/master/scripts/osm2pgsql-replication
Also you should probably read https://github.com/openstreetmap/osm2pgsql/discussions/1481 on the raw data / exporting data topic.
This script (along with some pyosmium import) supports all of those use cases out of the box already: https://github.com/openstreetmap/osm2pgsql/blob/master/scripts/osm2pgsql-replication
Thank you, I'll have a look to the implementation and see if we can get some inspiration from it.
I'm not the product owner so I can only reply about a few technical advantages in having this functionality available in underpass:
Also you should probably read openstreetmap/osm2pgsql#1481 on the raw data / exporting data topic.
This doesn't seem related (I'm not planning any export), wrong link?
Well, the project’s readme file states: The same database can also be used for data exports, and Underpass can be used to keep that data up to date.
The link was meant as a hint; osm2pgsql has different design goals which may not be in line with requirements to export data later on, or even manage osm data in its original raw format.
Well, the project’s readme file states: The same database can also be used for data exports, and Underpass can be used to keep that data up to date.
Oh, I see, that's the general overpass scope but this issue is only about a small enhancement, thanks for pointing this out anyway.
The link was meant as a hint; osm2pgsql has different design goals which may not be in line with requirements to export data later on, or even manage osm data in its original raw format.
I believe we are not planning to export OSM raw data but some processed extracts for map display purposes.
Correct. Not exporting the raw data in any form, but there for data extracts and conflation. It's very common for humanitarian purpose to use small data extracts as overlays when out in the field.
Goal
The goal is to update a raw OSM postgis database from the change data as it flows through Underpass.
User stories (kind of)
The raw OSM database will be used to support displaying maps on websites, extracting data (our export-tool), conflation of duplicates, etc... Currently most people just download from geofabrik the daily snapshot, but we want to do it in near real-time, processing the minutely change files.
DB Schema Selection
The key part is which OSM database schema to update, as here are 3. There's the ogr2ogr one, the osm2pgsql one, and the pgsnapshot one.
Schema Comparison
The table below compares some features of the three tools/DB schemas.
osm2pgsql
seems the most suitable because:.osc
files out of the box [1]Proposed workflow
osm2pgsql
binary tool, datapbf
for instance obtained from geofabrikosc
format and downloads themosm2pgsq
process takes care of the DB updateFuture enhancements (out of scope for this task)
In the above .3 it will it be possible to speed up operations bypassing the
osm2pgsql
process and using a direct implementation of the DB CRUD methods that update the DB reconstructing geometries when necessary. This is a significant effort which is not recommended in the current development phase (because it would be a form of premature optimization).Another advantage of a direct implementation of the updates would be to reduce external dependency.
Open Issues
The main problem is how to identify the https address of the first
OSC
file that needs to applied in order to keep the DB in sync, the current implementation in underpass is not working or just a stub but there is a configuration option to pass the address directly to the application.Implementation
Update Strategy
1. Find the current DB timestamp
The current timestamp of the DB is required in order to identify the first change file to process, a running Underpass application can keep this information in its internal memory obtaining it from the last update but the initial value must be calculated from the DB itself, the following query can be used to get the initial value:
2. find the correct OSC file
The OSC file that needs to be processed first is the closest (in time) OSC file that has a timestamp greater than the current DB timestamp.
The timestamp of an OSC file (for instance
049.osc.gz
[3] is contained in the corresponding*.state.txt
file (for instance:049.state.txt
[4]) but there is no way to obtain the path given a timestamp.Some research needs to be done in order to determine what is the best strategy to find the address of the first OSC file that needs to be applied, also looking at how other applications solved this problem. For the time being a direct configuration option to the first OSC file is already implemented and works just fine.
One possible strategy could consists in recursively downloading the index page of the change tree (i.e.
https://planet.maps.mail.ru/replication/minute/
), analyze thestate.txt
files until the change file closest (and greater) to the DB timestamp is found.The subsequent addresses can be calculated, knowing that each subfolder contains at most
999
OSC files and then the parent is incremented by one.3. download the OSC file and update the DB
The OSC download part given an address is already implemented in Underpass. The update part will run in a separate thread and will launch a separate
osm2pgsql
process, depending on the performances of the update operation we may choose to join the thread or let it run in the background, in the latter case some sort of locking mechanism will be required in order to prevent out-of-order concurrent updates to the DB. The simplest joining option will be attempted first, other more complex schemes will be implemented in case they are required.[1] https://osm2pgsql.org/doc/manual.html#import-and-update [2] https://ircama.github.io/osm-carto-tutorials/updating-data/ [3] https://planet.maps.mail.ru/replication/minute/004/679/049.osc.gz [4] https://planet.maps.mail.ru/replication/minute/004/679/049.state.txt