Closed bdon closed 3 years ago
🎉 great summary!
An open question is if the 1:1 strategy described above is good enough.
As far as we've gotten, the answer to this depends on your use case and the tradeoffs you can make with regards to when specific changes show up based on time or transaction boundaries. We're not sure there's one correct answer for everyone, attempted to summarize a bit here: https://github.com/azavea/onramp/blob/master/docs/index.md#use-cases.
osmx do not include all metadata for nodes
Just double checking and clarifying -- all metadata is included for tagged nodes. It's the untagged nodes that only have (or will soon) version info?
Just double checking and clarifying -- all metadata is included for tagged nodes. It's the untagged nodes that only have (or will soon) version info?
Yes, correct. The precise way to put it is "Nodes with 0 tags do not have timestamp, changeset, username or UID." All other information in the current state of OSM should be present.
https://github.com/protomaps/OSMExpress/blob/master/python/examples/augmented_diff.py is now usable to generate a partial augmented diff with create, modify, delete elements. I have tested the output with achavi: you can drag the file onto the map to visualize the diff: https://overpass-api.de/achavi/
Two things I have not implemented yet:
In addition I made some notes while implementing:
modify
in the OSC happens to an element the osmx does not have, this modify
is swizzled into a create
action instead. old
username, uid, changeset and timestamp of course by design.I've implemented propagating changes as well as bounds.
I haven't tested this in many situations but it should be a complete proof of concept. I did need to add some functions to the Python bindings in order to retrieve node->way, node->relation and way->relation references.
Once propagation is added in, the # of elements in the augmented diff increases a lot; this code is slow in general.
For regional extracts that are not reference-complete for all relations (e.g. admin boundaries) I skip propagation when a referenced member is missing from the osmx.
@CloudNiner the output of python augmented_diff.py may be useful to compare to Onramp.
🎉!
this code is slow in general
How slow is that, taking a ballpark guess at an average running time per OSM OSC file?
🎉!
this code is slow in general
How slow is that, taking a ballpark guess at an average running time per OSM OSC file?
I just tested it on planet.osmx with a minutely diff, it seemed to only take a few seconds, but OSCs vary widely in how complex they are.
https://github.com/azavea/onramp seems to be using this successfully, so marking this as closed.
(This is a summary of discussion in the OSMUS #dev slack channel)
Motivation
Replication diffs as OsmChange (.OSC) files are the standard way of consuming OSM updates. The OSC format is not reference-complete. Clients that want to see the before/after for a changed object's tags, geometry or metadata need to source this information from elsewhere.
The most popular "enhanced" diff format is the Augmented Diff described on the OSM wiki: https://wiki.openstreetmap.org/wiki/Overpass_API/Augmented_Diffs This is implemented by Overpass API. I'm not aware of other implementations.
In theory, one can generate augmented diffs with two inputs: 1. an OsmChange file and 2. an osmx database that's the complete state of OSM immediately before that OsmChange is applied. The Augmented Diff can then be hosted as a static file or put on S3. The benefit of this strategy is that it has very few moving parts.
In Development
@CloudNiner at Azavea is developing on this idea here: https://github.com/azavea/onramp which is a C++ implementation. This is likely the way to go for a production-ready system. It may be worth writing a Python one as well if only to validate the correctness of outputs across different implementations.
Augmented Diff format
<new>
element for deleted objects. It might depend on how the OsmChange was generated. Again, if clients don't depend on this information it might not matter.