Should pyosmium-up-to-date respect an .osm.pbf's bounds

daniel-j-h commented 4 months ago

Hi there! Suppose I have used osmium extract to generate a small (< 10 MB) .osm.pbf file of an area from a snapshot and I have used the --set-bounds options so that the bounds get written into the file header.

I want to keep this small file up to date e.g. on a daily basis by running pyosmium-up-to-date but when I do so it looks like

It takes multiple minutes (not that big of a deal)
After it finishes I end up with a vastly bigger file (~85 MB)
The updates file no longer seem to include a bounding box in its header

Here is the osmium fileinfo output on the .osm.pbf pyosmium-up-to-date generates:

File:
  Name: latest.osm.pbf
  Format: PBF
  Compression: none
  Size: 88664952
Header:
  Bounding boxes:
  With history: no
  Options:
    generator=pyosmium-up-to-date/3.6.0
    osmosis_replication_base_url=https://planet.osm.org/replication/hour/
    osmosis_replication_sequence_number=103469
    osmosis_replication_timestamp=2024-07-02T12:00:00Z
    pbf_dense_nodes=true
    timestamp=2024-07-02T12:00:00Z

I wanted to flag this behavior because it was unexpected to me and I'm not sure if this is by design.

My workaround for now is the following

Download a snapshot .osm.pbf once e.g. from the Geofabrik download service (> 370 MB)
Use osmium extract to cut a small .osm.pbf out of it (< 10 MB)
Every day a. run pyosimium-up-to-date (~ 100 MB) b. re-run osmium extract as in step 2 to re-cut for the specific bounds (< 10 MB)

Thank you! Also happy for any pointers on how other folks keep their small extracts up to date!

joto commented 4 months ago

I don't know of a good solution for this problem. Your workaround is what some people have tried, I remember seeing scripts to that effect floating around. The problem is that the OSM data model basically makes this impossible to do cleanly, you end up implementing some heuristic.

Protomaps can do minutely updates of extracts, but I think you need a complete database for that also. And download.openstreetmap.fr offers mintely updates extracts, so apperently they have solved this somehow, but I don't know how they do it.

daniel-j-h commented 4 months ago

I don't know of a good solution for this problem. Your workaround is what some people have tried, I remember seeing scripts to that effect floating around. The problem is that the OSM data model basically makes this impossible to do cleanly, you end up implementing some heuristic.

Got it, but this means the running pyosmium-up-to-date and then osmium extract should work and do the trick? The only downside is some wasted downloaded data but in the end I'll get a an osm.pbf that adheres to the bounds and is up to date, yes?

Protomaps can do minutely updates of extracts, but I think you need a complete database for that also. And download.openstreetmap.fr offers mintely updates extracts, so apperently they have solved this somehow, but I don't know how they do it.

I do use protomaps' .pmtiles format in my pipeline generated by tilemaker. Meaning my pipeline looks something like

Download an .osm.pbf from Geofabrik once
Use osmium extract to cut out a very small area .osm.pbf
Run tilemaker to generate .pmtiles from that small area .osm.pbf
Every day a. Run pyosmium-up-to-date on that small area .osm.pbf b. Run osmium extract to cut out the small area again after the update c. Run tilemaker to generate .pmtiles from that small area .osm.pbf

The great thing with these small .pmtiles files is that I can host them e.g. on a github page and if the update pipeline above is fast enough (a few minutes) I could even have a github action generating a new .pmtiles file and checking it in.

lonvia commented 4 months ago

Just one word of warning here: due to the way the OSM data is structured, you should always use a bounding box that is some 50-100km larger than what you need. OSM objects around the fringes of your extract may move in and out of the bounding box and that is not always captured correctly during the updates.

lonvia commented 4 months ago

I'll take into consideration adding bbox cutting to pyosmium-up-to-date. It might have to wait for the next rewrite of the tool, though.

joto commented 4 months ago

Got it, but this means the running pyosmium-up-to-date and then osmium extract should work and do the trick?

No. That's the problem, there is no way to make sure this will always work except by having a complete OSM database. It will usually work, but as lonvia said, if you have objects near the boundary moving in and out, it can break. Or weird relations or so.

daniel-j-h commented 4 months ago

Ow, okay thank you folks I didn't know about the need for a 50-100 km buffer. That will make it quite a bit more heavy for the very small extracts I'm working with.

Do I still need the 50-100km buffer even if I only care about objects always within the bounds and never crossing the bounds? For example let's say I have a <10 MB extract of a very small area (e.g. small remote town) where I only care about buildings. Do I still need the 50-100km buffer?

Where can I learn more about this? Does it boil down to understanding changesets and how they're getting generated?

What would be a good way then to update these <10MB small extracts? Simply re-downloading a .osm.pbf e.g. from Geofabrik and using osmium extract or using a 50-100km buffer in the first place?

joto commented 4 months ago

There is no simple answer here. It all depends on what you are doing with the data and what kinds of glitches in the data you are prepared to work with/ignore. The buffer is simply a way to reduce the number of glitches you might have, if something happens near the border of your extract, it is not foolproof. You have to understand the OSM data model and what data is in changes and what isn't.

All that being said, if you don't do anything fancy with relations this is not going to be a big problem in practice. Use a buffer big enough that all objects you care about are well inside the extract. And do a clean re-import every half year or so so that if something was messed up you start from a clean setup every once in a while.

lonvia commented 4 months ago

Maybe an example helps to illustrate what kind of glitches we talk about:

Say, you want to make an extract of Görlitz. Given its situation right at the Polish border, you cut the extract along the river Neisse. That works well when you create the extract, you have now all buildings on the western side of the river. You happily apply diffs to the extract until one day a mapper realizes that one of the buildings was in fact put on the wrong side of the river. They move the building from the eastern bank to the western bank. That means the building should now appear in your extract. However, there is a small problem. Because of the topological nature of the OSM data model, you move a building by changing the coordinates of the nodes that make up the building. You do not touch the OSM way with the actual building information. So when you get the diff with the change, there is the new position of the nodes, but there is no information about the OSM way describing the building. The way was not changed, so it is not in the diff. And because you are working with an extract, you don't have the information about the way either because when you cut the extract it was outside the area of interest. The moved building will not appear on your map.

So the 50-100km is a very conservative estimate how much mappers are moving things around on the map in a way that creates these kind of glitches. If you would work only with node data, you wouldn't need any buffer at all. If you are interested in only the buildings, a 2-5 km buffer is probably sufficient already.

daniel-j-h commented 4 months ago

Aah! Thank you so much folks now I understand the constraints a bit better - I didn't know about this! :raised_hands:

I will add a buffer then and make sure to re-import from scratch every now and then :+1:

osmcode / pyosmium

Should pyosmium-up-to-date respect an .osm.pbf's bounds #256