gravitystorm / openstreetmap-carto

A general-purpose OpenStreetMap mapnik style, in CartoCSS
Other
1.53k stars 821 forks source link

Plan database re-import #1504

Closed matthijsmelissen closed 7 years ago

matthijsmelissen commented 9 years ago

This is a management issue for the upcoming re-import of the database. A re-import of the database gives us the opportunity to change the keys that are available to the stylesheet.

We will use the lua tag transformation option from osm2pgsql, and enable the hstore option.

Questions to be answered

The following changes have been requested:

Keys to be added

Adding hstore will avoid the need to manually add these keys.

Adding hstore will avoid the need to manually add these keys.

Pull request written, see https://github.com/openstreetmap/osm2pgsql/pull/346.

Pull request written, see https://github.com/openstreetmap/osm2pgsql/pull/346.

This has already been fixed in the current default lua style file.

matthijsmelissen commented 9 years ago

@pnorman @gravitystorm Could you have a look what else should be added to this issue?

matthijsmelissen commented 9 years ago

@pnorman wrote in #1243:

At the same time as introducing hstore, the number of fixed columns should be reduced, and this should lead to a noticeable performance gain, but this requires benchmarking work and query rewrites.

This is already quite a big project, and it lingers already for quite some time. Would it be an option to only enable hstore in the first database reload, and drop the columns that have become spurious due to hstore only in the second database reload (in 6-12 months)? I think that would also make things operationally much simpler, because we wouldn't need to coordinate the database reload and the stylesheet rollout that tightly.

As I said, it's quite a big project given the amount of developer time, and I'm afraid that if we want to do too much as the same time, we will keep pushing this issue forward forever. On the other hand, the performance benefit would be very welcome.

imagico commented 9 years ago

Just for clarification - hstore will allow later use of keys that were not originally on the list?

Otherwise a larger scale brainstorming on what important keys are currently missing would be in order.

And it's leaf_type, not leaftype by the way.

matthijsmelissen commented 9 years ago

Just for clarification - hstore will allow later use of keys that were not originally on the list?

Yes, but only for objects that have another key that is in the list. For example, if we have an object with man_made=pipeline and location=underground, we would be able to access underground even though underground is not in the list. However, we wouldn't be able to access an object only tagged emergency=ambulance_station if emergency is not in the list.

Otherwise a larger scale brainstorming on what important keys are currently missing would be in order.

Yes, I think we should have a larger scale brainstorming anyway (but hstore should reduce the number of tags that we need).

And it's leaf_type, not leaftype by the way.

Thanks, corrected, things seem to have changed since the issue was created.

imagico commented 9 years ago

Ok - so the most important thing is to have all primary keys on the list while secondary keys can also be used ad hoc later if the object has a known primary key.

What comes to mind is geological - not that widely used but widespread possible applications.

Other primary keys that are fairly widespread in use and could be considered for rendering are:

For secondary keys i would have glacier:type and seasonal

pnorman commented 9 years ago

This is already quite a big project, and it lingers already for quite some time. Would it be an option to only enable hstore in the first database reload, and drop the columns that have become spurious due to hstore only in the second database reload (in 6-12 months)? I think that would also make things operationally much simpler, because we wouldn't need to coordinate the database reload and the stylesheet rollout that tightly.

I've gotten most of the way through the style file converting to hstore, so I should soon be able to start seeing what columns shouldn't be in hstore.

matthijsmelissen commented 9 years ago

I've gotten most of the way through the style file converting to hstore, so I should soon be able to start seeing what columns shouldn't be in hstore.

That's good news! Any work in progress you can show?

pnorman commented 9 years ago

Pushed my dev branch up to https://github.com/pnorman/openstreetmap-carto/tree/hstore

matthijsmelissen commented 9 years ago

Great work! It looks indeed like most work on rewriting the queries has been done.

Would it be easy to convert the data from the .style file to the .lua file?

I saw that you dropped for example aerialway from the .style file. Just to check my understanding, I suppose you'll need to add it back at a later point?

pnorman commented 9 years ago

I saw that you dropped for example aerialway from the .style file. Just to check my understanding, I suppose you'll need to add it back at a later point?

That's the unknown. There is no need to have any columns for specific tags, as anything not in a column is in with hstore.

That doesn't mean it's a good idea. For some, for example amenity, landuse, highway, natural, and building, it is almost certainly a good idea to have them as a column. To my knowledge, no one has done any rigorous testing of what should and should not be a column. That's my intent.

matthijsmelissen commented 9 years ago

That's the unknown. There is no need to have any columns for specific tags, as anything not in a column is in with hstore.

I'm still a bit confused about it all. How will osm2pgsql know whether to add an object with an aerialway as a row to the database? Currently we filter out objects that do not have any of the keys in the list of keys. Will this remain the case? If so, how does osm2pgsql know not to filter out aerialways if there is no aerialway in the list of keys?

However, we wouldn't be able to access an object only tagged emergency=ambulance_station if emergency is not in the list.

Do I understand it correctly that this is incorrect?

pnorman commented 9 years ago

However, we wouldn't be able to access an object only tagged emergency=ambulance_station if emergency is not in the list. Do I understand it correctly that this is incorrect?

Yes, the statement you quoted is incorrect.

For this part of SQL rewrites I am importing with --hstore-all but this would be --hstore.

https://github.com/openstreetmap/osm2pgsql/blob/master/docs/usage.md#hstore might help explain the options a bit better. We would not be using --hstore-match-only, as it would require us to reload the database for every new tag key we wanted to render, and will always offer minimal gains for the standard style

Previous research showed

It is immediately obvious that --match-only has virtually no impact when used with the standard style because most objects have at least one tag used by default.style.

Other hstore options like -j/--hstore-all and --match-only are of minimal use except for specialized cases.

.style development is one of the limited situations where --hstore-all is worthwhile, as it reduces the number of DB reloads I need to do while changing the SQL.

matthijsmelissen commented 9 years ago

Thanks for the clarification, I updated the first post.

We should still decide if there are keys / values for which we want to change whether closed lines are interpreted as polygon or linestring, right?

brycenesbitt commented 9 years ago

Tag toilets:disposal would be nice, so that the option exists to render flush toilets differently from chemical and pit toilets. https://github.com/gravitystorm/openstreetmap-carto/issues/1508

drinking_water or drinking_water=no would also be good, to render campgrounds and huts with no water with a modified symbol indicating the lack of water.

denotation may be there already. If not, it lets you render prominent trees more, um, prominently.

matkoniecz commented 9 years ago

it lets you render prominent trees more, um, prominently

I think that more prominent rendering for named trees would be enough. I also see a problem with that tag (see http://wiki.openstreetmap.org/wiki/Talk:Key:denotation ).

gravitystorm commented 9 years ago

As @pnorman has explained, the discussion of which tags to "add" is unnecessary, so lets not spend time on it. All features will appear when we use --hstore and all keys will be available (either in hstore or in normal columns).

brycenesbitt commented 9 years ago

@gravitystorm even on review, the above discussion does not clearly reflect the magnitude of this shift, especially given the long history of the database limitations. Could you confirm: 1) All keys will now be available to the rendering, even keys invented in the future. The list of new keys above does not represent a limited set of new rendering options. 2) There is performance advantage for certain key/values to end up with dedicated columns. 3) @pnorman is trying to work out a more optimal set of keys for dedicated columns.

matthijsmelissen commented 9 years ago

@brycenesbitt the best person to ask these questions to is @pnorman.

mboeringa commented 9 years ago

As @pnorman has explained, the discussion of which tags to "add" is unnecessary, so lets not spend time on it. All features will appear when we use --hstore and all keys will be available (either in hstore or in normal columns).

+1, it's great to have this, it will help you all a lot. This is what I already have available to me for my personal renderer in development (thanks to ESRI implementing it's own key/value storage for OSM based data).

@gravitystorm even on review, the above discussion does not clearly reflect the magnitude of this shift, especially given the long history of the database limitations. Could you confirm: 1) All keys will now be available to the rendering, even keys invented in the future. The list of new keys above does not represent a limited set of new rendering options. 2) There is performance advantage for certain key/values to end up with dedicated columns. 3) @pnorman is trying to work out a more optimal set of keys for dedicated columns.

Ultimately, @pnorman indeed may answer best, but from what I know and read, I think these are all sound conclusions...

gravitystorm commented 9 years ago

Could you confirm:

1) Yes 2) Yes 3) Yes

:-)

mboeringa commented 9 years ago

As a mild reminder, this of course does not solve the true "cartographic" problems of how to cope with, and display, all of this data at various scales... that will still require thorough human evaluation and consideration (not even taking into account possible technical challenges when more features are going to be rendered).

But that should certainly not withhold on implementing this in any way as @gravitystorm already pointed out ;-)

pnorman commented 9 years ago

9 tags left in my conversion, then I can start benchmarking.

There's a bunch of SQL cleanups I want to do to master once we've worked out the .style file and before a more precise conversion.

kocio-pl commented 9 years ago

Are there any news of progress or outstanding problems regarding database re-import?

pnorman commented 9 years ago

Some progress with hstore testing. I have the benchmark setup all working, and have tested name and am running building tests next. There was quite a delay between the two since I needed to do some osm2pgsql testing, which used the same machine.

matthijsmelissen commented 9 years ago

Thanks for the update! What other steps are necessary before we can roll this out?

HolgerJeromin commented 9 years ago

As far as i understood https://github.com/mapnik/mapnik/issues/2875 one main point for osm-carto is:

For consistent behavior between 2.2 and 3.0 use vertical-alignment: middle . This may require adjustments to text-dy .

matthijsmelissen commented 9 years ago

As far as I know, the database re-import and the Mapnik upgrade are independent issues.

HolgerJeromin commented 9 years ago

You mean hstore availability and mapnik 3. Sure, you are right. Got confused :-)

rrzefox commented 8 years ago

Has there been any recent progress on determining which tags should be put into extra columns and which just go into hstore? Can't this mostly be determined algorithmically? Since most time will be spent in I/O, shouldn't the fastest database layout in almost all cases be the one that consumes the least space? In that case, we should be able to calculate that. I'm unsure about the postgresql internals, my experience here is "5 minutes of google", my formula may be wrong and reality is probably a bit more complex than that, but I'll try anyways: If pg_column_size() is correct, then storing a key-value pair into the hstore column uses length(key) + 1 (for terminating \0) + length(value) + 1 + 6 bytes, but ONLY for rows where the tag appears. In contrast, an extra column always uses 1/8th of a byte for the null bitmap per row, plus length(value) + 1 when the tag is actually used. Lets put that together for two examples. Since the values use up the same amount of space in both cases, we can simply leave them out.

osm=# select count(*) from planet_osm_polygon where building is not null;
   count
-----------
 164180837
(1 row)

osm=# select count(*) from planet_osm_polygon where lock is not null;
 count
-------
   232
(1 row)

osm=# select count(*) from planet_osm_polygon;
   count
-----------
 197455663
(1 row)

So a huge percentage of polygons have the tag building=... Space usage for the hstore case: 164180837 * ((len("building")+1+6) = 15) = 2462712555 Bytes or 2463 M For the extra column: 197455663 * (1 / 8) = 24681958 Bytes or 25 M. I'd say we have a clear winner here, but that was expected.

If we try the same for some rarely used tag: lock=... Extra column uses same space as above, 25 M. hstore on the other hand: 232 * ((len("lock")+1+6) = 11) = 2552 Bytes or 0 M. I'd say it's a safe bet to drop that column...

matthijsmelissen commented 8 years ago

Pinging @pnorman - what do you think?

pnorman commented 8 years ago

Size is not a good metric. It would indicate that tags like source should have their own column. Where clause usage is more significant, and it shows up in query execution times with filtering and in query plan quality with poor hstore statistics.

Given that right now it's more important to improve the cartography and cut back on rendered features, I've only gotten partway through looking at hstore performance.

rrzefox commented 8 years ago

That's not a realistic example though, because source is not currently a column; on the contrary, it is explicitly blacklisted, so it doesn't even get added to hstore-enabled databases. Also, I wasn't suggesting to completely turn off ones brain while evaluating that.

I tried to do some actual benchmarking for what I suggested, but didn't get very far. Due to space constraints, I was only able to use an extract of Germany and not a full planet. The few results I got make it clear for me though that my suggestion was not good - not for the reasons pnorman suggested, but because space savings were absolutely negligible. What I compared was the default stylesheet and the modified one at https://gist.github.com/rrzefox/b6ad215d7e834ad123a1 - the modified one was created as suggested above, by removing those columns that would provide a space saving. Initial Import: 17337 (default) vs. 17124 (modified) seconds (-1%) Space usage: 40516668 vs. 40477956 KB (-0%) Rendering from that DB was slower (as expected, because the machine had enough memory to fit the whole DB into cache), by up to 4 percent. Part of that may have come from the fact that I did not spend time on converting the whole stylesheet to hstore, instead I just did it like with the german style: added a view that maps the database fields (https://gist.github.com/rrzefox/ac64742a7e3a7406c9f5) and only searched+replaced planetosm* by viewosm*

The hardly measurable space savings IMO suggest that it's really not worth the effort to move columns to hstore. pgsql is apparently extremely efficient at storing large amounts of null columns. Or are there cases where hstore-columns actually IMPROVE query execution times?

pnorman commented 8 years ago

Or are there cases where hstore-columns actually IMPROVE query execution times?

Yes. The gains I got from dropping some seldom-used columns was greater then the loss for adding hstore.

matthijsmelissen commented 8 years ago

the loss for adding hstore.

@pnorman Do you have any idea what the performance loss from adding hstore is, if we keep using the old columns, don't drop any columns from the database, and don't refer to hstore in the queries? That would be a good starting point for comparing the effects on performance.

matthijsmelissen commented 8 years ago

I just found out that this was answered here - using hstore gives a 0.24% speed decrease, and a 10% size increase.

matthijsmelissen commented 8 years ago

I plan to continue working on this again.

Decide for which key/value combinations we want to change the polygon vs linestring behaviour for closed ways.

I'm thinking about this question in particular now. I asked a question here historic=citywalls, waterway=breakwater, and waterway=groyne. I'd also like to invite comments from contributors of this repository on that discussion.

matthijsmelissen commented 8 years ago

I also opened https://github.com/openstreetmap/osm2pgsql/pull/532, which should also be useful for this project.

matthijsmelissen commented 8 years ago

I have created a branch called hstore against which PRs can be made that prepare for the database reload.

matthijsmelissen commented 8 years ago

@pnorman I still don't fully understand the relation between style.lua and default.style.

For example, which of the two processes is run first?

Some functionality seems to be present in both style.lua and default.style, for example the line/polygon decision. If both files have contradictory rules on this aspect, what has priority?

Same question for deleting keys: if a key should be deleted (for example a key added by an import), is it sufficient to either include a line with delete flag in default.style, or to include it in delete_tags in style.lua?

I hope you can help with these questions I have!

pnorman commented 8 years ago

style.lua is an example file, and not included in an osm2pgsql install. default.style is installed, and is the style file used by default.

When doing a normal osm2pgsql import, no lua file is used, columns come from the style file, and the C tag transforms are used for determining what column values are in the database, if something goes in the roads table, and if something is a polygon.

With lua tag transforms, they determine what the C tag transforms would, and the columns still come from the style file.

pnorman commented 8 years ago

We now have a lua branch to work on. The next steps are

matthijsmelissen commented 8 years ago

Add tests Profile more of the Lua code

IMO these would be nice-to-haves, not must-haves that are release-blocking.

pnorman commented 8 years ago

Once we've merged Lua, we should tag v3.0.0, but continue to backport to v2.x because it will take time for our users to reimport.

sommerluk commented 8 years ago

As far as I understand, the code for database reload is now in the lua branch. Currently, this branch shall contain only the code changes for the reload itself, and no rendering-related bug fixes (at least until deployed to the openstreetmap.org servers).

What are the plans/timeframe for merging the lua branch into master? What are the next steps?

pnorman commented 8 years ago

I added a new label lua so we can tag issues if they're with that branch

pnorman commented 8 years ago

So, the big issue is testing the Lua transforms. At some point I'll probably set up a rendering server with the new branch, but I'd like other contributors who are not maintainers to try it locally first. If there is not sufficient interest for people to do this, it will likely take longer.

sommerluk commented 8 years ago

I'd like other contributors who are not maintainers to try it locally first

@pnorman I’m not sure if this is the kind of things you want to hear about, but I can load Ivory Coast in the database without errors. Output here:

osm2pgsql version 0.91.0-dev (64 bit id space)

Using lua based tag processing pipeline with script /home/sommerluk/Documents/MapBox/project/openstreetmap-carto/openstreetmap-carto.lua
Using projection SRS 3857 (Spherical Mercator)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads
Allocating memory for sparse node cache
Node-cache: cache=800MB, maxblocks=12800*65536, allocation method=9
Mid: pgsql, scale=100 cache=800
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels

Reading in file: OSM-Daten/ivory-coast-latest.osm.pbf
Using PBF parser.
Processing: Node(2146k 102.2k/s) Way(198k 10.45k/s) Relation(770 154.00/s)  parse time: 45s
Node stats: total(2146499), max(4251283644) in 21s
Way stats: total(198607), max(425678475) in 19s
Relation stats: total(776), max(6250829) in 5s
Committing transaction for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels
Using lua based tag processing pipeline with script /home/sommerluk/Documents/MapBox/project/openstreetmap-carto/openstreetmap-carto.lua
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels
Using lua based tag processing pipeline with script /home/sommerluk/Documents/MapBox/project/openstreetmap-carto/openstreetmap-carto.lua

Going over pending ways...
        127849 ways are pending

Using 2 helper-processes
Finished processing 127849 ways in 27 s

127849 Pending ways took 27s at a rate of 4735.15/s
Committing transaction for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads
Committing transaction for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_polygon
Committing transaction for planet_osm_roads

Going over pending relations...
        0 relations are pending

Using 2 helper-processes
Finished processing 0 relations in 0 s

Committing transaction for planet_osm_point
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_line
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_polygon
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_roads
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_point
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_line
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_polygon
WARNING:  there is no transaction in progress
Committing transaction for planet_osm_roads
WARNING:  there is no transaction in progress
Sorting data and creating indexes for planet_osm_point
Sorting data and creating indexes for planet_osm_line
Sorting data and creating indexes for planet_osm_polygon
Sorting data and creating indexes for planet_osm_roads
Copying planet_osm_point to cluster by geometry finished
Creating geometry index on planet_osm_point
Copying planet_osm_roads to cluster by geometry finished
Creating geometry index on planet_osm_roads
Creating osm_id index on planet_osm_roads
Creating indexes on planet_osm_roads finished
Creating osm_id index on planet_osm_point
All indexes on planet_osm_roads created in 5s
Completed planet_osm_roads
Creating indexes on planet_osm_point finished
All indexes on planet_osm_point created in 7s
Completed planet_osm_point
Copying planet_osm_line to cluster by geometry finished
Creating geometry index on planet_osm_line
Copying planet_osm_polygon to cluster by geometry finished
Creating geometry index on planet_osm_polygon
Creating osm_id index on planet_osm_line
Creating indexes on planet_osm_line finished
All indexes on planet_osm_line created in 14s
Completed planet_osm_line
Creating osm_id index on planet_osm_polygon
Creating indexes on planet_osm_polygon finished
All indexes on planet_osm_polygon created in 15s
Completed planet_osm_polygon
Stopping table: planet_osm_nodes
Stopped table: planet_osm_nodes in 0s
Stopping table: planet_osm_ways
Building index on table: planet_osm_ways
Stopped table: planet_osm_ways in 31s
Stopping table: planet_osm_rels
Building index on table: planet_osm_rels
Stopped table: planet_osm_rels in 0s
node cache: stored: 2146499(100.00%), storage efficiency: 50.00% (dense blocks: 0, sparse nodes: 2146499), hit rate: 100.03%

Osm2pgsql took 119s overall

I can also open it in Tilemill. I’ve startet with zoom level 1 and zoomed in (in various places that I know) up to level 20. I did not get a rendering error. I also did not remark changes in rendering at a first glace. But it’s of course not a systematic test.

pnorman commented 8 years ago

I can also open it in Tilemill. I’ve startet with zoom level 1 and zoomed in (in various places that I know) up to level 20. I did not get a rendering error. I also did not remark changes in rendering at a first glace

Good

kocio-pl commented 8 years ago

I can try it too, but I'm not sure how should the test look like, roughly the same as the one above?

Is the import log worth pasting or it's not interesting until the rendering seems to be working?

How many countries should we check (and which ones)?

pnorman commented 8 years ago

Is the import log worth pasting

Only if there are errors