hotosm / underpass

A customizable data engine for processing mapping data
https://underpass.hotosm.org
GNU General Public License v3.0
31 stars 7 forks source link

OSM raw data replication into PostGIS DB #45

Closed elpaso closed 2 years ago

elpaso commented 3 years ago

Goal

The goal is to update a raw OSM postgis database from the change data as it flows through Underpass.

User stories (kind of)

The raw OSM database will be used to support displaying maps on websites, extracting data (our export-tool), conflation of duplicates, etc... Currently most people just download from geofabrik the daily snapshot, but we want to do it in near real-time, processing the minutely change files.

DB Schema Selection

The key part is which OSM database schema to update, as here are 3. There's the ogr2ogr one, the osm2pgsql one, and the pgsnapshot one.

Schema Comparison

The table below compares some features of the three tools/DB schemas.

Name Tool Language OSC Support Geometry in the main table
osm2pgsql C++ yes yes
ogr2ogr C++ no yes
pgsnapshot -- yes no

osm2pgsql seems the most suitable because:

Proposed workflow

  1. initial data import through osm2pgsql binary tool, data pbf for instance obtained from geofabrik
  2. underpass monitoring thread looks for change files in osc format and downloads them
  3. underpass pipes the downloaded change files content into a separate thread where a spawned osm2pgsq process takes care of the DB update

Future enhancements (out of scope for this task)

In the above .3 it will it be possible to speed up operations bypassing the osm2pgsql process and using a direct implementation of the DB CRUD methods that update the DB reconstructing geometries when necessary. This is a significant effort which is not recommended in the current development phase (because it would be a form of premature optimization).

Another advantage of a direct implementation of the updates would be to reduce external dependency.

Open Issues

The main problem is how to identify the https address of the first OSC file that needs to applied in order to keep the DB in sync, the current implementation in underpass is not working or just a stub but there is a configuration option to pass the address directly to the application.

Implementation

Update Strategy

  1. find the current DB timestamp
  2. find the correct OSC file
  3. download the OSC file and update the DB

1. Find the current DB timestamp

The current timestamp of the DB is required in order to identify the first change file to process, a running Underpass application can keep this information in its internal memory obtaining it from the last update but the initial value must be calculated from the DB itself, the following query can be used to get the initial value:

select max(foo.ts) from (
select max(tags -> 'osm_timestamp') as ts from planet_osm_point
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_line
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_polygon
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_roads) as foo

2. find the correct OSC file

The OSC file that needs to be processed first is the closest (in time) OSC file that has a timestamp greater than the current DB timestamp.

The timestamp of an OSC file (for instance 049.osc.gz [3] is contained in the corresponding *.state.txt file (for instance: 049.state.txt [4]) but there is no way to obtain the path given a timestamp.

Some research needs to be done in order to determine what is the best strategy to find the address of the first OSC file that needs to be applied, also looking at how other applications solved this problem. For the time being a direct configuration option to the first OSC file is already implemented and works just fine.

One possible strategy could consists in recursively downloading the index page of the change tree (i.e. https://planet.maps.mail.ru/replication/minute/), analyze the state.txt files until the change file closest (and greater) to the DB timestamp is found.

The subsequent addresses can be calculated, knowing that each subfolder contains at most 999 OSC files and then the parent is incremented by one.

3. download the OSC file and update the DB

The OSC download part given an address is already implemented in Underpass. The update part will run in a separate thread and will launch a separate osm2pgsql process, depending on the performances of the update operation we may choose to join the thread or let it run in the background, in the latter case some sort of locking mechanism will be required in order to prevent out-of-order concurrent updates to the DB. The simplest joining option will be attempted first, other more complex schemes will be implemented in case they are required.

[1] https://osm2pgsql.org/doc/manual.html#import-and-update [2] https://ircama.github.io/osm-carto-tutorials/updating-data/ [3] https://planet.maps.mail.ru/replication/minute/004/679/049.osc.gz [4] https://planet.maps.mail.ru/replication/minute/004/679/049.state.txt

elpaso commented 3 years ago

CC @robsavoye assigned to you for review, I will take care of the implementation when the implementation plan is accepted.

mmd-osm commented 3 years ago

This script (along with some pyosmium import) supports all of those use cases out of the box already: https://github.com/openstreetmap/osm2pgsql/blob/master/scripts/osm2pgsql-replication

Also you should probably read https://github.com/openstreetmap/osm2pgsql/discussions/1481 on the raw data / exporting data topic.

elpaso commented 3 years ago

This script (along with some pyosmium import) supports all of those use cases out of the box already: https://github.com/openstreetmap/osm2pgsql/blob/master/scripts/osm2pgsql-replication

Thank you, I'll have a look to the implementation and see if we can get some inspiration from it.

I'm not the product owner so I can only reply about a few technical advantages in having this functionality available in underpass:

Also you should probably read openstreetmap/osm2pgsql#1481 on the raw data / exporting data topic.

This doesn't seem related (I'm not planning any export), wrong link?

mmd-osm commented 3 years ago

Well, the project’s readme file states: The same database can also be used for data exports, and Underpass can be used to keep that data up to date.

The link was meant as a hint; osm2pgsql has different design goals which may not be in line with requirements to export data later on, or even manage osm data in its original raw format.

elpaso commented 3 years ago

Well, the project’s readme file states: The same database can also be used for data exports, and Underpass can be used to keep that data up to date.

Oh, I see, that's the general overpass scope but this issue is only about a small enhancement, thanks for pointing this out anyway.

The link was meant as a hint; osm2pgsql has different design goals which may not be in line with requirements to export data later on, or even manage osm data in its original raw format.

I believe we are not planning to export OSM raw data but some processed extracts for map display purposes.

robsavoye commented 3 years ago

Correct. Not exporting the raw data in any form, but there for data extracts and conflation. It's very common for humanitarian purpose to use small data extracts as overlays when out in the field.