rustprooflabs / pgosm-flex

PgOSM Flex provides high quality OpenStreetMap datasets in PostGIS (Postgres) using the osm2pgsql Flex output.
MIT License
101 stars 20 forks source link

support update mode #275

Closed SailorMax closed 1 year ago

SailorMax commented 1 year ago

What version of PgOSM Flex are you using?

0.6.2

What did you do exactly?

  1. > docker exec -it -e POSTGRES_HOST=127.0.0.1 -e POSTGRES_DB=postgres pgosm python3 docker/pgosm_flex.py --ram=8 --language=en --data-only --append --schema-name=osm --srid=4326 --skip-nested --skip-dump --input-file=argentina-latest.osm.pbf
  2. > docker exec -it -e POSTGRES_HOST=127.0.0.1 -e POSTGRES_DB=postgres pgosm python3 docker/pgosm_flex.py --ram=8 --language=en --data-only --append --schema-name=osm --srid=4326 --skip-nested --skip-dump --input-file=bolivia-latest.osm.pbf

What did you expect to happen?

Database with data of both countries.

What did happen instead?

Database with only Argentina's data.

What did you do to try analyzing the problem?

Investigate results. Looks like your append-mode required internet connection and additional tables in public-schema to update primary data. Can you add update-mode, which just add new data (by osm_id?) from source files to already created db/schema/tables? (do not drop/modify database/schemas/tables)

thank you.

rustprooflabs commented 1 year ago

@SailorMax I want to make sure I understand what you're trying to accomplish. There are a few similar/overlapping concepts, I want to make sure we're talking about the same things.

Append mode

The --append mode uses osm2pgsql's append mode and osm2pgsql-replication to enable updating the database over time using diff files. Currently, multiple input files are not supported with append mode. The functionality in this project is dependent on the functionality those tools expose. There's an open pyosmium ticket (https://github.com/osmcode/pyosmium/issues/214, originally suggested in osm2pgsql-replication https://github.com/openstreetmap/osm2pgsql/pull/1769) to support this type of feature. When that feature is implemented upstream I would like to support that feature as well. Updates using --append do require an internet connection and the extra tables to track the replication in the public schema. I suspect there is a way to do this w/out requiring the internet connection, though I have not attempted that at all.

Multiple input files, single import

Importing multiple PBF files as one data set is supported by osm2pgsql but not yet by the Docker image. I show an example of that functionality under the Multiple PBF Inputs section of this post. For an immediate workaround to this you should be able to use osmium merge to give a single --input-file. Or, you can use PgOSM Flex without Docker by installing everything and running osm2pgsql manually. That will let you use the command shown in that blog post with multiple inputs.

I have an idea in mind on how to make this work well in the Docker image. It won't be in place immediately, though I think it should be able to support:

Multiple input files at different times

The exact commands you shared show running these files at seperate times. I'm not sure if that's a side effect of trying to make it work, or desired functionality. E.g. do you want to be able to import Region A on Monday then add Region B on Tuesday? If so,does osm2pgsql by itself allow this in create mode? I sort of expect it does, I just can't recall testing this approach before. If osm2pgsql supports this, I'm sure we can work out a way to make this work in the Docker image too.

SailorMax commented 1 year ago

I mean "Multiple input files at different times". Something like osm2pgsql --append which work without osm2pgsql-replication. Why I need this:

  1. I have offline network => I use pbf-files
  2. In result periodically I need refresh data, but
  3. Sometimes I need one more region to add to database.

=> Recreate all database is not good solution for me. Better give to script fresh pbf-file of any used or not used region and append all new osm_id records to exist database and remove deleted. Not sure about update polygons and it's data.

rustprooflabs commented 1 year ago

@SailorMax I've been digging into this, there's definitely room for improvement! Here are my thoughts so far.

First -- I had no idea osm2pgsql --append could add entirely new regions to an existing database! I had rarely used --append mode before the Flex import was added. I did too much post-processing to make that seem worthwhile.

I had misconceptions about how --append mode had been implemented in this Docker image. I struck through the incorrect details in my previous comment (https://github.com/rustprooflabs/pgosm-flex/issues/275#issuecomment-1336495225). I now see the osm2pgsql --append is not directly available through PgOSM Flex's Docker image. To properly support osm2pgql's --append mode, the osm2pgsql-tuner project needs to support it first. That is another place where I started adding code to handle append mode but didn't fully support it.


:warning: Breaking change :warning:

I plan to move the current --append functionality to --replication. That more accurately describes what's happening in the current implementation.


I'd like to avoid re-using --append, even though that's what it's called in osm2pgsql. In reality, using osm2pgsql manually results in two specific types of commands, the first one, and all the others. I'm thinking about:

The --for-update option would be used for the first run into a clean database. This will let it know to use --slim without --drop, required (right?) for a subsequent --append to work. The --update command will use osm2pgsql --append for all subsequent imports. I think this should work for both additional regions (.osm.pbf) and diff files (.osc).

Loading additional files with osm2pgsql --append is pretty slow it seems. I loaded an initial D.C. sized region (18 MB) took 40 seconds, with nodes processing at 450k/s. The New Hampshire (52 MB) addition via osm2pgsql --append took 23.5 minutes, only 5 k/s processing speed on nodes. This slow timing makes it hard to work into the automated testing via make.

osm2pgsql -d $PGOSM_CONN  --cache=1157 --slim --output=flex --style=./run.lua /app/output/district-of-columbia-2022-11-08.osm.pbf 

2022-12-06 19:57:39  osm2pgsql version 1.7.2
2022-12-06 19:57:39  Database version: 15.1 (Debian 15.1-1.pgdg110+1)
2022-12-06 19:57:39  PostGIS version: 3.3
...
2022-12-06 19:58:14  Reading input files done in 35s.                                     
2022-12-06 19:58:14    Processed 1800659 nodes in 4s - 450k/s
2022-12-06 19:58:14    Processed 228856 ways in 26s - 9k/s
2022-12-06 19:58:14    Processed 4524 relations in 5s - 905/s
...
2022-12-06 19:58:19  osm2pgsql took 40s overall.
osm2pgsql -d $PGOSM_CONN --append --cache=1157 --slim --output=flex --style=./run.lua /app/output/new-hampshire-latest.osm.pbf 
...
2022-12-06 20:05:36  osm2pgsql version 1.7.2
2022-12-06 20:05:36  Database version: 15.1 (Debian 15.1-1.pgdg110+1)
2022-12-06 20:05:36  PostGIS version: 3.3
...
2022-12-06 20:28:36  Reading input files done in 1380s (23m 0s).                          
2022-12-06 20:28:36    Processed 6836823 nodes in 1260s (21m 0s) - 5k/s
2022-12-06 20:28:36    Processed 587272 ways in 111s (1m 51s) - 5k/s
2022-12-06 20:28:36    Processed 4729 relations in 9s - 525/s
...
2022-12-06 20:29:02  osm2pgsql took 1405s (23m 25s) overall.
rustprooflabs commented 1 year ago

@SailorMax This feature seems to be working as expected on the dev branch. The current documentation is in the dev branch in this MD file: https://github.com/rustprooflabs/pgosm-flex/blob/dev/docs/UPDATE-MODE.md Those instructions will be cleaned up and likely moved as I work this into the main branch, hopefully in the next few days.

The :dev Docker image (docker pull rustprooflabs/pgosm-flex:dev) has the changes included. If you have a chance to test this out it would be great to know if it works as expected for you or not. Thanks!

rustprooflabs commented 1 year ago

Closing as completed, will be in 0.7.0 soon via #290