Ideas for improved geometry processing

joto commented 2 years ago

At the core of osm2pgsqls mission is the processing of geometries from OSM data into some useful format in a PostgreSQL/PostGIS database. Osm2pgsql is one step in a larger toolchain transforming the geometries in to pixels on your screen showing a map (or help finding places in the world of whatever the final use of the data is).

Conceptionally processing in osm2pgsql has these steps:

The OSM objects (nodes, ways, and relations) are assembled into geometries.
The geometries are optionally transformed in some way to make them more useful or easier or quicker to access.
The geometries are loaded into the database.

You might think that osm2pgsql is not doing much in the second step, but there are several (optional) operations which fall into that category:

Transform the geometry into the target projection (usually web mercator)
Split up long linestrings
Split up multipolygons into polygons
Generate expire lists (which, if you think about it, are just another type of geometry) (see #1662 for ideas about expire lists)
Generate the bounding box for use in the flex config file
Calculate the area of (multi)polygons
Check validity of geometries (with the help of PostGIS, this really happens partially in step 1 and partially after step 3)

Geoprocessing in the database

All other geometry processing is currently delegated to the database. After all one of the major reasons we are using the PostgreSQL/PostGIS database system is its powerful geometry processing capabilities. Users of osm2pgsql use the database to calculate labelling points, simplify linestring and polygons, merge multiple smaller objects into larger ones and many more things.

But there are some costs involved with doing all those things in the database:

Sometimes you do the geometry processing when accessing the data (for instance when rendering a tile) which means you might do a lot of work several times which could be made once.
When you want to do the work whenever the original OSM data changes, you have to use the rather coarse expire mechanism to trigger geometry processing or you have to set something up with database triggers etc.
If all you need is some piece of smaller data (like the center of a polygon, not the whole polygon), sending the big geometry to the database first (and maybe even committing it to disk) and reducing it there is wasteful compared to processing in osm2pgsql.
The Lua config file doesn't have access to any data that's only created in the database.
Writing code in the database can be quite complex, especially if triggers and materialized views etc. are involved.

So it makes sense to add some more geometry processing capabilities to osm2pgsql. The PR #1636 is where we are testing some of these.

But adding these kinds of capabilities only gets you so far. The way osm2pgsql operates you can usually only operate on a single feature at a time. So we can calculate the centroid of a polygon or simplify the geometry of a single way. But whenever we want to operate on multiple features, we really need to go to the database. Again, that's why we have the database, because it can easily find a bunch of objects and do some geometry processing on them. So that kind of processing is not going to go away.

Working with updates

If we only do a one-off import of OSM data, we can easily run a SQL script afterwards that does any kind of processing we can imagine. Many people do that already. But if we want to be able to update the database this becomes more tricky. We need mechanisms to track what changes need to be done and trigger those changes. There are several options how such a tracking and triggering could be done:

Split the world into pieces and keep track of which pieces need re-processing. That's what expire list is really. See #1662 for ideas about expire lists that would make this much more flexible and useful.
Store the OSM id with each object in the database and re-processs them when a change for that id comes in. This is what osm2pgqsl does to decide which features to re-create in the database. You can piggyback on that with database triggers. This is not easy to do because you need to handle new objects and deletions and changed objects (which osm2pgsql simply treats as a deletion plus a new object). It is possible, but not that easy to, say, join all linestrings of a longer street this way and keep track of all the constituent geometries and keep the result up-to-date this way. It would be great if we can make this kind of thing easier to do.
Somehow keep track of attributes and use them to trigger re-processing. If we join all the linestrings of motorway M17, we can find the one geometry with ref=M17 again and update it if needed. This isn't easy because you need to take any old and new tags on OSM objects into account, but it is something we could add some support for in osm2pgsql.

So where do we go from here

There is already some work underway to allow for more geometry processing inside osm2pgsql (see PR #1636). The goal here is to make some processing much easier to do and more accessible to the casual user.
There are ideas for improving expire handling (see #1662)
We need to rethink the way we do updates when OSM objects change. Currently basically a DELETE/INSERT is done. To use this in the database you need to add a trigger on the DELETE to basically ignore it, and than a trigger on the INSERT that does any updates needed. Maybe we can add some support to osm2pgsql to make this easier and more straightforward. At a minimum we should document the way this is supposed to work so non-power-users can set up something like this.
We should think about ways of doing attribute-based "expire" and re-processing. One option would be to add some kind of hash to each database entry. That hash must be calculated in the Lua config based on the tags used in an object. And osm2pgsql would keep track of those hashes and make sure objects with the same hash are re-processed whenever an object with that hash changes.

I am sure there are more ideas.

mboeringa commented 2 years ago

It is possible, but not that easy to, say, join all linestrings of a longer street this way and keep track of all the constituent geometries and keep the result up-to-date this way. It would be great if we can make this kind of thing easier to do.

Hi @joto ,

Nice comprehensive overview of the problems!

To keep things manageable though, I would first focus the efforts on "per-object" geometry processing for now (node, way, relation), and leave the onus for handling random collections of objects not part of OpenStreetMap relations to the end user of the produced database.

E.g. I think the one major enhancement that would be welcomed by many, is enhanced processing of relations, like the ability to filter relation members based on their relation "role" and construct some geometry from the filtered members, merge relations members into larger structures etc.

Currently, relations are a kind of "second-rate" citizen in the realm of OpenStreetMap software. Except for multipolygons and stuff like boundaries, there isn't a whole lot of support in the ecosystem to handle them.

If osm2pgsql had targeted functions to do clever things with relations, like the filtering on role, OpenStreetMap relations could in fact become much more useful and "first-rate" citizens, as they deserve to be!

It is especially the handling of relations that is difficult if impossible to do at the database level (not even taking into account that you currently can't access the full relation structure after import using osm2pgsql or most other import tools, let alone in the context of a continuously updated database).

mmd-osm commented 2 years ago

Just for clarification: is the tool support for country defaults @lonvia mentioned in https://lists.openstreetmap.org/pipermail/talk/2022-April/087426.html somehow in scope for improved geometry processing?

lonvia commented 2 years ago

It's in scope but we are still discussing to what extent.

Tool support for country defaults basically means an easy mechanism to determine for each OSM object which country it is in. With the recently merged get_bbox() function for relations, it's already possible to implement something like that in the flex lua scripts. You'd have to do the lookup from bbox to country through external libraries and then use the country to compute your country defaults.

What we are discussing is if osm2pgsql should offer some native function to determine the 'region' directly to make all this easier. The lookup from region to attribute defaults will likely remain in the responsibility of the script writer because it is fairly easy and efficient to do in Lua.

zdila commented 2 years ago

I am missing simplify for (multi)polygons which is mentioned in issue description. Also for GeometryCollections (it would simplify all (multi)linestrings there). I think it deserves a separate issue. May I create it? Also In my case simplify was also failing for multilinestrings but it should be AFAIK supported.

tordans commented 2 years ago

Topic: Remove road stubs – I am processing OSM road data to create a road class network for evaluating bicycle infrastructre. For this, only those roads that are part of a network – as in "they connect to another road" – are interesting. This cleanup becomes more important, the more driveways and footways that lead to houses are mapped.

Eg: https://www.openstreetmap.org/#map=18/52.38108/13.59252 shows a lot of those stubs.

The naive approach I am taking is removing all ways <15m of specific highway types. However, that will remove small segments in the middle of the road and leave larger segments that do not lead anywhere like general highway=service no road network connection.

I started looking into improving the checks with PostGIS: Take the start point, end point > create a buffer > check if roads are part of the buffer (except self) > Only keep those where this is true for both start and end. However, I did not get this query right, yet (code not yet open source).

So ideally, there is a way to specify which stubs I want to remove based on length and road class. Maybe even the class of road that needs to be connected to (consider (or not) a service road a stub if it connects to a footway).

Another thing I noticed is, that this process might need to run multiple times: Once I remove the first stub, there is new one that just as well fits the definition of a stub.

The OSM area at https://www.openstreetmap.org/#map=15/52.3723/13.6252 is a good test ground since it has a lot of stub and smaller road segments.

mboeringa commented 2 years ago

Topic: Remove road stubs – I am processing OSM road data to create a road class network for evaluating bicycle infrastructre. For this, only those roads that are part of a network – as in "they connect to another road" – are interesting. This cleanup becomes more important, the more driveways and footways that lead to houses are mapped.

@tordans ,

To be honest, I think this request is way beyond what osm2pgsql is currently capable of, and what osm2pgsql should become.

I think it is unrealistic - and probably even undesirable in the light of code complexity and maintainability of osm2pgsql - to have osm2pgsql become some sort of "full network topology" suite. There are other excellent open source (PostGIS based) solutions for that based on OpenStreetMap, and PostGIS itself has the "Topology" data type and options.

osm2pgsql should IMO stick to what it is currently doing excellently (and the new flex option enhances): process single OSM objects (nodes, ways, relations) / geometries as fast and efficiently as possible without requiring "network" type knowledge of the data.

The only exception to this, is the already partially enhanced capabilities to deal with "OSM relations" (whether multipolygon or any other type of valid OSM relation is irrelevant here, as they are essentially the same from a technical point of view), as osm2pgsql already has knowledge about relations in order to be able to process them, and relations are just a "single" object from the point of view of OSM.

So any geometries (nodes, ways) contained / referenced in such relation, can be - and already must - be processed as a whole, and specific relation processing is already partially possible in osm2pgsql v1.7.0, e.g., see my attempts to use the new capabilities of v1.7.0 to extract "main_stream" role relation members from OSM "waterway" relations:

https://github.com/openstreetmap/osm2pgsql/discussions/1752

mboeringa commented 2 years ago

@tordans ,

The other problem with this request might be that, as far as I understand the whole processing flow of osm2pgsql up to now, the required information to perform such operations simply isn't there at the stage of import itself:

Tables for nodes and ways are not yet fully build during the import process, so you cannot perform geometry-to-geometry type geometric processing
No spatial indexes have been build yet, because that happens as the final stage, so there is no knowledge of spatial relations at all (except those contained in OSM relations, e.g. 'route' or 'waterway'), and no mechanism to quickly look up a geometry based on its spatial location.

So these type of operations you are suggesting, can essentially only be performed after the import.

joto commented 2 years ago

@mboeringa You are right, this issue is more about geometry processing in osm2pgsql and what @tordans wants to do is much more ambitious and currently not possible in the osm2pgsql framework. But I have been thinking about similar issues for other use cases I want to support somewhere down the line, for instance generalization of roads for small zoom levels. They have in common that they need some kind of network analysis of the road network that goes beyond simple geometry processing. That's definitely something we can't currently do in osm2pgsql, because osm2pgsql only looks at one object at a time, but it is something that we might be able to do with osm2pgsql and the database together, if we can find the right model of approaching this. It is definitely in scope for the generalization project, so I do want to hear about those use cases, although it would have fit better in a new discussion thread.

mboeringa commented 2 years ago

@joto

Interesting to read the German project description! I would caution you though, to carefully consider what is achievable in the project's funded limited time frame.

Adding "per-object" generalization capability, like the already implemented 'object.simplify()' function, is a kind of no-brainer, but performing high quality generalization (like the one I showed on my SOTM 2022 poster for woodland features of OpenStreetMap), involving potentially thousands of OSM objects being merged into one "representative" object, is far more complicated, especially in the light of the project's aim to support updates as well. E.g. how to keep track of such numbers of ID's of source objects, and even be able automatically update the features in a reliable and performant matter, and to deal with a potential need to subdivide generalized features as well for performance reasons after generalization, seems a potentially daunting task... It is definitely an ambitious target.

My own developed "post-osm2pgsql-processing" routines are definitely not suitable for such "continuous update" type scenario's. In my case, I optimized it for high performance full re-imports, not continuously update-able databases.

By the way, I don't know if you already know it, but you may find Tomas Straupis' work and blog posts for the Lithuanian OSM community and public OSM based Lithuanian topographic maps an interesting read in the light of your project: https://www.openstreetmap.org/user/Tomas%20Straupis/diary Also see the Github repository of his project: https://github.com/openmaplt/vector-map/tree/master/db

kennykb commented 2 years ago

I have been thinking about similar issues for other use cases I want to support somewhere down the line, for instance generalization of roads for small zoom levels. They have in common that they need some kind of network analysis of the road network that goes beyond simple geometry processing. That's definitely something we can't currently do in osm2pgsql, because osm2pgsql only looks at one object at a time, but it is something that we might be able to do with osm2pgsql and the database together, if we can find the right model of approaching this. It is definitely in scope for the generalization project, so I do want to hear about those use cases, although it would have fit better in a new discussion thread.

(Agreed that we probably need a new thread for this... but... sorry, I'm here now.)

One use case tor me has been the placement of highway shields on route concurrencies. This involves quite a bit of postprocessing at query time (surprisingly, it's not all that slow!) to construct the topology of strings of ways with the same cluster of routes.

You can see the code that I use at https://github.com/kennykb/osm-shields. shieldtables.lua populates the auxiliary tables, and then the heavy lifting happens in the analyze_merkers SQL function in queryprocs.sql.in. The key to how it works is that it extracts members of the same set of routes (not just the same routes), and then groups them as much as possible with the PostGIS ST_LineMerge function. I use this code regularly for my own work; it's almost certainly NOT ready for any kind of production, but it works well enough for me at the moment.

mboeringa commented 1 year ago

Well, probably not directly relevant to the project and the new generalization options of osm2pgsql, but via the JTS GitHub repository, I stumbled on this project:

https://github.com/micycle1/PGS#morphology

If not for just the beautiful animations of an astounding array of geometric transformations, it might at least inspire some philosophical thinking about generalization options... ;-) Takes some time to load the page, but well worth the wait.

osm2pgsql-dev / osm2pgsql