protomaps / OSMExpress

Fast database file format for OpenStreetMap
BSD 2-Clause "Simplified" License
229 stars 19 forks source link

augmented diff example program #17

Closed bdon closed 3 years ago

bdon commented 4 years ago

(This is a summary of discussion in the OSMUS #dev slack channel)

Motivation

Replication diffs as OsmChange (.OSC) files are the standard way of consuming OSM updates. The OSC format is not reference-complete. Clients that want to see the before/after for a changed object's tags, geometry or metadata need to source this information from elsewhere.

The most popular "enhanced" diff format is the Augmented Diff described on the OSM wiki: https://wiki.openstreetmap.org/wiki/Overpass_API/Augmented_Diffs This is implemented by Overpass API. I'm not aware of other implementations.

In theory, one can generate augmented diffs with two inputs: 1. an OsmChange file and 2. an osmx database that's the complete state of OSM immediately before that OsmChange is applied. The Augmented Diff can then be hosted as a static file or put on S3. The benefit of this strategy is that it has very few moving parts.

In Development

@CloudNiner at Azavea is developing on this idea here: https://github.com/azavea/onramp which is a C++ implementation. This is likely the way to go for a production-ready system. It may be worth writing a Python one as well if only to validate the correctness of outputs across different implementations.

Augmented Diff format

  1. the bounding box for relations in the augmented diff format does not recursively include sub-relations.
  2. Overpass generates augmented diffs dynamically based on time ranges. There is not a 1:1 correspondence with replication OsmChange files. More detail at the onramp docs: https://github.com/azavea/onramp/blob/master/docs/index.md and elsewhere: https://github.com/drolbr/Overpass-API/issues/346 https://github.com/azavea/osmesa/issues/52 An open question is if the 1:1 strategy described above is good enough.
  3. osmx do not include all metadata for nodes - see https://github.com/protomaps/OSMExpress/issues/12 - so clients that depend on that metadata may not be able to use this.
  4. It's unclear to me what version number should go in the <new> element for deleted objects. It might depend on how the OsmChange was generated. Again, if clients don't depend on this information it might not matter.
CloudNiner commented 4 years ago

🎉 great summary!

An open question is if the 1:1 strategy described above is good enough.

As far as we've gotten, the answer to this depends on your use case and the tradeoffs you can make with regards to when specific changes show up based on time or transaction boundaries. We're not sure there's one correct answer for everyone, attempted to summarize a bit here: https://github.com/azavea/onramp/blob/master/docs/index.md#use-cases.

osmx do not include all metadata for nodes

Just double checking and clarifying -- all metadata is included for tagged nodes. It's the untagged nodes that only have (or will soon) version info?

bdon commented 4 years ago

Just double checking and clarifying -- all metadata is included for tagged nodes. It's the untagged nodes that only have (or will soon) version info?

Yes, correct. The precise way to put it is "Nodes with 0 tags do not have timestamp, changeset, username or UID." All other information in the current state of OSM should be present.

bdon commented 4 years ago

https://github.com/protomaps/OSMExpress/blob/master/python/examples/augmented_diff.py is now usable to generate a partial augmented diff with create, modify, delete elements. I have tested the output with achavi: you can drag the file onto the map to visualize the diff: https://overpass-api.de/achavi/

Two things I have not implemented yet:

In addition I made some notes while implementing:

bdon commented 4 years ago

I've implemented propagating changes as well as bounds.

I haven't tested this in many situations but it should be a complete proof of concept. I did need to add some functions to the Python bindings in order to retrieve node->way, node->relation and way->relation references.

Once propagation is added in, the # of elements in the augmented diff increases a lot; this code is slow in general.

For regional extracts that are not reference-complete for all relations (e.g. admin boundaries) I skip propagation when a referenced member is missing from the osmx.

@CloudNiner the output of python augmented_diff.py may be useful to compare to Onramp.

CloudNiner commented 4 years ago

🎉!

this code is slow in general

How slow is that, taking a ballpark guess at an average running time per OSM OSC file?

bdon commented 4 years ago

🎉!

this code is slow in general

How slow is that, taking a ballpark guess at an average running time per OSM OSC file?

I just tested it on planet.osmx with a minutely diff, it seemed to only take a few seconds, but OSCs vary widely in how complex they are.

bdon commented 3 years ago

https://github.com/azavea/onramp seems to be using this successfully, so marking this as closed.