osm2pgsql-dev / osm2pgsql

OpenStreetMap data to PostgreSQL converter
https://osm2pgsql.org
GNU General Public License v2.0
1.5k stars 474 forks source link

New "flex" backend #1036

Closed joto closed 4 years ago

joto commented 4 years ago

In the past weeks I have started working on a new output backend. I am calling this the "flex" backend, because it is more flexible than any of the other backends and should eventually be able to support all use cases currently supported by the other backends and many more. This flexibility is mostly due to an increase in the use of Lua scripts which are used for configuration as well as more powerfull callback functions.

This work has mostly been triggered by #230, but it also touches many other issues like #901 .

Here are some design thoughts:

Osm2pgsql design for a more flexible backend and Lua integration

We write a new backend according to the ideas presented in this document. The new backend should be able to do everything the current backends can do and would replace them in the long-term. (This also means that the C++-only transforms without Lua will not be supported forever.)

Table configuration

The new setup can work with any number of database tables. We provide Lua functions define tables and their columns. Different tables can have different column setups, for instance some tables might have lots of columns for tags ("old-style"), some have only an HSTORE or JSONB column for tags.

Lua callbacks

For every OSM object (node, way, or relation) that is processed by osm2pgsql, a Lua callback is called with the data from this object as parameter. The Lua callback can process the data of the OSM object in any way it likes and emits zero, one, or more table entries (for any table) that should be added by osm2pgsql. This is done by calling an add_row() function on those tables.

Processing in a Lua callback includes but is not limited to:

Part of this functionality would be supported by C++ functions callable from Lua (something like geom = MakeCentroid(geom)).

Getting data of related objects

For some use cases we need data from related objects that we want to add to the currently processed object. We might, for instance, want the roles and tags from all relations a way is a member of to be added to the way data in some form.

This is solved by allowing Lua functions to

So in the mentioned use case the following would happen:

  1. When processing a way with, say a highway tag, the Lua script tells osm2pgsql: Please find all relations having this way as member and call a Lua callback function later to re-process this way.
  2. In a later step osm2pgsql processes relations (either because they are in the input data and/or it gets relations out of the existing database) and calls the Lua callback function with the data of the way from step 1 and the relation data.
  3. This Lua callback can now use all of the data it gets and put them into any table. In the mentioned use case it could add a PostgreSQL ARRAY with relation tags to the way, or merge specific tags in a comma-separated string, or whatever is needed.

Note that this design means that finding related OSM objects is restricted to one level of relatedness. So you can bring together ways with their nodes and relations with their member nodes, ways, or relations. But further "nesting" of relations can't be resolved. This simplifies the design here considerably and should support enough uses cases. More complicated processing has to be done somewhere else.

Drawbacks of the design

The design has its problems:

Advanced Issues

Extra data

For some uses cases it might be useful to allow extra data to be added to the processing step. For instance we might want to have height data from somewhere outside OSM and add it to geometries. Or we want to add to each highway the information whether it is in a country with left- or right-hand-traffic.

The design of the Lua API should keep this in mind and allow this extra data to be made available somehow, but this is for a future step.

Prefiltering in C++

It might be useful for better performance to allow some kind of optional pre-filtering in C++ before Lua functions are called. So we might want to be able to define that a Lua function is only called for all OSM objects with a highway tag. This way we can save us the expensive Lua calls for everything that's not a highway.

We can see how the performance looks before deciding whether we need this, but it would change the configuration needed pretty fundamentally, so we have to do this before finalizing the implementation.

Import vs. Update Mode and Separation of Passes

The Lua scripts should have the information whether we are in "create" or in "append" mode. Or we have separate scripts for the modes.

We might also want to tell the Lua scripts whether they are run in pass 1 (when reading the data) or in pass 2 (when working on "marked" data).

It is currently unclear how the two passes needed for append mode and the two passed needed for the more complex relationship processing will interact.

mmd-osm commented 4 years ago

The complex interaction between Lua and C++ code (multiple phases, storing/retrieving data, marking objects for later processing) made me wonder if you had planned any kind of debugging/tracing/troubleshooting option for people writing Lua scripts?

joto commented 4 years ago

@mmd-osm You can use the full power of Lua to do that, in the simplest case just print out intermediate values etc. But you can also open a log file and write stuff there etc. If anything this becomes easier, because more of the magic isn't hidden in C++ code but plain to see in Lua code.

openstreetmap-tiles commented 4 years ago

On Thu, Dec 19, 2019 at 5:45 AM Jochen Topf notifications@github.com wrote:

In the past weeks I have started working on a new output backend. I am calling this the "flex" backend, because it is more flexible than any of the other backends and should eventually be able to support all use cases currently supported by the other backends and many more. This flexibility is mostly due to an increase in the use of Lua scripts which are used for configuration as well as more powerfull callback functions.

This work has mostly been triggered by #230 https://github.com/openstreetmap/osm2pgsql/issues/230, but it also touches many other issues like #901 https://github.com/openstreetmap/osm2pgsql/issues/901

This is really, really welcome news! Thanks for tackling it.

Given the discussion of #230 https://github.com/openstreetmap/osm2pgsql/issues/230, could I beg you to contemplate the possibility of allowing the designer to represent relations as database rows without geometry, but with auxiliary tables to represent the relation membership? I suppose the members could instead be tracked as arrays in the main row. I'm an old man, and an old-school DBA, and have rather a preference for breaking fields of variable cardinality off into their own tables, perhaps clustering with the main table.

https://github.com/kennykb/osm-shields/wiki/Proposal:-add-route-tables-to-osm2pgsql describes the particular use case that I have in mind. My current implementation of the rendering works by dredging the appropriate information out of the slim tables - which I recognize is deprecated, and I'd like to move away from. If I have enough hooks in the code that I can populate tables like these two and keep them up to date when osmosis runs, I can return to working on making osm-shields capable of coping with minutely updates and scaling to a larger map area than I currently attempt to render.

I'm not asking to have tables like this on the main OSM server. I do not intend to have the main render chain depend on them. I'm simply trying to find a way to support my experimental renderings with osm2pgsql, in a manner that isn't deprecated and doesn't require me to fork the project. The project https://github.com/kennykb/osm-shields for rendering US shaped highway shields and route concurrencies has been on hold for a little over a year, largely because I haven't had the time and energy to pursue retooling all the rest of my rendering chain to work with imposm3 instead of osm2pgsql. I had started an effort on adding support for this kind of table structure to osm2pgsql, but was informed that if I did develop it and submit a PR, the PR would be rejected.That response convinced me that I needed to switch the osm-shields project over to imposm3, but I've not been able to muster the time and energy to work through the impact that would have on the rest of my rendering chain and my other data consumers. You offer me a ray of hope that I won't have to tackle that rework.

Anticipating similar objections to the ones I've heard in the past:

'It can't perform well.' I'm looking to deploy this on my own server, not in the OSM main render chain. If I'm willing to accept the performance penalty (I am) and the burden of maintaining the code that supports it (I am, if the developers will have me), and the feature will not degrade the performance of code that does not use it (I treat that as a requirement), why is this such an issue.

'Implementing it will require a database reload.' It's my database. I'm not proposing to deploy this on the main servers, or anywhere else that a user isn't accepting it willingly.

'You can do what you want to with highway shields without this sort of support.' Please show me how. I've heard this assertion from several members of the osm2pgsql team. Alas, I haven't been smart enough to put the hints that they've given about how to implement such a thing into anything that resembles a coherent design that I can implement. Note that the messy North American highway network needs to worry about route concurrencies (very common!) and such things as having the state highway of one state briefly cross over into another state without changing its numbering or signage (About half of New York Route 120A https://www.openstreetmap.org/relation/407958 is in Connecticut, but it is signed and maintained by New York.) Moreover, the 'ref' tag does not contain sufficient information to choose the sign to use - a ref such as 'CR 104' identifies only that it is a county road, and the choice of sign needs to know which county. That information, as we just saw, cannot be reliably inferred by intersection with administrative boundaries.

joto commented 4 years ago

@openstreetmap-tiles The whole point of this effort is that you can do things like proper highway shields efficiently without having to have extra tables in your database. So I think if this all pans out the way I envision it now, this should be able to solve your problem. Allowing tables without any geometry was a small change, so I have added this to the code.

I have added an example highway-shields.lua that demonstrates how to a) get refs from the route relations into the ways (in a comma-separated list) and b) put route relations into their own table and add their ids to the member ways (in an array). Here is a rendering of that data. The "shields" you are seeing are rendered from the way geometries, but have the refs from the relations. Note the E22 and A1 refs on the bottom. I'll leave it to you to come up with better rendering. :-)

route-refs-as-shields

joto commented 4 years ago

@ImreSamu Please don't put everything and the kitchen sink into this issue. This is all not really related to the issue at hand and not useful to have everything in one huge issue.

ImreSamu commented 4 years ago

@joto : sorry ; I have removed the "noise" ..

nextstopsun commented 4 years ago

@joto Could you please add an example with an additional geometry column and calculating a centroid (i.e. for boundary=administrative features with admin_level=6)?

joto commented 4 years ago

@nextstopsun Multiple geometry columns and centroids don't work yet. Supporting multiple geometries is difficult, especially because a geometry might be split into several geometries (multi-geometries or splitting long lines), and I don't know yet how exactly to do this.

nextstopsun commented 4 years ago

@joto So in theory with the new flex backend one can set up a table with two or more columns of type geometry, right?

Which lua library could be used to perform operations on geometry (i.e. calculation of a centroid)?

joto commented 4 years ago

So far this is all a rough design and we don't know yet what the code will look like. I would like to support several geometry columns, because this is basically a generalization of a "normal" geometry plus the way_area we currently have and I see other use cases where this could be useful (polygon and centroid for instance), but as mentioned it isn't clear how exactly this would look like in the code and it might turn out to be to complex to do. osm2pgsql still needs to be reasonably simple to use and understand, so there is a limit on how complex things should get.

I am pretty sure we are not going to give Lua access to the actual geometries and are not using any Lua libraries for this, because it would be quite a lot of overhead "shipping" geometries from C++ to Lua code and back etc. All this should be done in C++ code for performance, but this would, of course, put some limits on what can be done.

I'd be interested in hearing about use cases and ideas people might have, this might influence some design decisions.

joto commented 4 years ago

The current version of the PR is now in pretty good shape. It doesn't do everything we might ever want (no multiple geometry columns yet, for instance), but it is getting towards something that might me mergeable. There is even some documentation and lots of examples in the same directory.

There are still things to do and to decide though. The main points are:

If you are interested in this issue, now would be a good time to test #1037 and give me feedback.

pnorman commented 4 years ago

I am pretty sure we are not going to give Lua access to the actual geometries and are not using any Lua libraries for this, because it would be quite a lot of overhead "shipping" geometries from C++ to Lua code and back etc. All this should be done in C++ code for performance, but this would, of course, put some limits on what can be done.

👍

  • Reliance on Lua: osm2pgsql would long-term not work without Lua any more. If this is seen as a problem, we could always keep the C++ transforms and users of that just don't have all possibilities then.

I don't see this as an issue. With osm-carto and styles based on it using Lua, the C transforms are not essential as long as we have reasonable default options.


I recall a mention of geometryless tables but can't find it again, had you discussed this?

nextstopsun commented 4 years ago

@joto could you please add an example where all tags are sent to jsonb column?

Is it possible to define postgresql db schema name for a table in lua config? (i.e. when you want to put tables to different schemas and tablespaces)

joto commented 4 years ago

@pnorman You can do geometryless tables simply by not defining a geometry. The flex-config/route-relations.lua has an example where a table simply stores relation tags.

@nextstopsun Filling a json(b) columns requires some Lua json library to create that json. You can then define a column with type json or jsonb and simply fill it with that data. If this is something a lot of people need, we might want to build it into osm2pgsql, but for the time being any json library in Lua should do.

There are some provisions in the flex backend code to set a schema for a table. But it doesn't work yet, because there are other changes that need to be done to the rest of the osm2pgsql code to make this work. I didn't want to do these changes before the flex backend is actually merged. So this is something for a bit later. Different tablespaces work as before, ie you can set a tablespace for all tables, but not individually.

joto commented 4 years ago

@nextstopsun I have added an example for how to deal with json here: https://github.com/joto/osm2pgsql/blob/flex-backend/flex-config/places.lua

mboeringa commented 4 years ago

Hi @joto

A couple of thoughts about a more "flexible" backend that are not directly related to having multiple tables, but other options that might benefit usage:

, e.g. "highway" or "aaa_highway"

Why would I want this? There are OpenStreetMap keys that are worth maintaining both as direct input by users, and as a tag converted column in the output table. By allowing to set prefixes for both original data columns and tag converted columns, you could maintain two sets of columns based on a single key, with one containing the original data and the other the tag converted value.

A classic example of a key that is worth maintaining both as an original value and as tag converted one, is the OSM direction key. It may contain both angles specified in degrees (0-359) and cardinal directions (NWSE).

You might then have two columns in the output table, one with the original values: aaa_direction '11' 'N' '340' 'NE'

And the other a tag converted one: bbb_direction 11 0 340 45

The tag converted column "bbb_direction" could then be used directly in applications to show e.g. the direction of viewpoints, while the original column "aaa_direction" still shows the original user input.

E.g. "addrhousenumber".

Of course, you could default to a single underscore, but allowing a custom string potentially allows setting a special marker and converting back to the original key. E.g., you might insert double underscore to signify the position of the colon versus the single underscore also used in openstreetmap keys.

E.g. "addr__housenumber" with double underscore could then be easily recognized as being "addr:housenumber" as key. This allows all kind of nice wizardry and automation of conversion between openstreetmap key names and database column names.

joto commented 4 years ago

@mboeringa With the flex backend, all names of tables and columns are set in your Lua code, so this is all just a question of writing a little bit of Lua code that does whatever naming you want. There is no need for any special prefixes or colon replacement characters. I suggest you give it a try and if there is something you can't do with Lua code, open an issue for that.

mboeringa commented 4 years ago

@mboeringa With the flex backend, all names of tables and columns are set in your Lua code, so this is all just a question of writing a little bit of Lua code that does whatever naming you want. There is no need for any special prefixes or colon replacement characters. I suggest you give it a try and if there is something you can't do with Lua code, open an issue for that.

Interesting. To be honest I have never messed with the Lua code up to now, so am unfamiliar with it yet. I do see references to what I presume are example table definitions in the pull's "files" view as linked below, e.g. in "simple.lua", that also include OpenStreetMap key names, but I don't see an example of a column being "renamed".

That is of course understandable, because the current system also simply uses the key names as column names since PostgreSQL doesn't have a an issue with things like colon in the field's name, but it leaves me wondering where of how to specify an "alias" or alternative name for the output column name.

https://github.com/openstreetmap/osm2pgsql/pull/1037/files

nextstopsun commented 4 years ago

Is it possible to setup "planarized" version of administrative borders table? I.e. I need a table with LINESTRING geometry loaded from ways with boundary=administrative plus all ways that form outer rings of bounday=administrative relations. Those ways must be distinct and have an array of admin_level values of rels to which they belong (similar to road shields example).

@joto Does flex backend have the power to solve this case?

joto commented 4 years ago

@mboeringa You simply name the columns whatever you want and later fill them:

buildings = osm2pgsql.define_way_table('buildings', {
    ....
    { column = 'addr_street', type = 'text' },
    ....
})
...
function osm2pgsql.process_way(object)
   ...
   buildings:add_row({ addr_street = object.tags['addr:street'] })
   )
   ...
end
joto commented 4 years ago

@nextstopsun Yes, in theory that should be possible. You'd have to store the admin_level information from the relations in memory and then, in stage 2, write all ways that either have those tags themselves or are members of relations that have those tags to the database. You will have to hold a lot of data in memory though, so I don't know how feasible this is. Also you have to handle all ways in stage 2 if you can't restrict this to ways with some specific tags. This will likely be too slow.

mboeringa commented 4 years ago

You simply name the columns whatever you want and later fill them:

@joto ,thanks. That explains it well and would work.

joto commented 4 years ago

The PR has now been merged and the flex backend is available. Please try it out and report problems or post ideas in new issues. Closing here.