Release policy and/or infrastructure

systemed commented 9 years ago

It's good to see so much activity on osm-carto.

An unintended side effect, however, is that the rendering queue is now frequently bogged down after a release, making it difficult to view maps on osm.org - particularly at high zoom levels. The mapper is unable to view his/her changes, and navigating the map to get to the area he/she wants to edit is made difficult or impossible.

In some cases it's easy to accept this as a trade-off for a frequently updated style, but when the changes are comparatively minor, we need to consider whether the impact is worth it. This issue is placed here to brainstorm possible solutions.

A few to start:

Bigger releases, less often
'Stable' and 'development' styles
Default the osm.org view to another style (e.g. MapQuest Open) when rendering queues are full
Don't invalidate tiles as much, or at all, on a style change (EDIT via @tomhughes on IRC: "we don't invalidate tiles as such - if the machine is too busy to rerender will we happily return an old tile, but only if we have one - we don't deliberately fail to serve a tile just because we think it is out of date... other than more hardware the only solution would be not to touch the timestamp file, so tiles would only re-render when there data changes")
Reconfigure rendering servers so that a style change causes tiles only to be invalidated on one server (EDIT: same proviso as above)
moar hardware

(Any solution could be an osm-carto one, an OSM Operations one, or an openstreetmap-website one. Rather than open three separate issues I thought it more helpful to start with one repository and see where the discussion leads.)

kocio-pl commented 9 years ago

In fact it's stranger than this - I often see my changes rendered almost instantly, but then the tiles are "reverted" to the older version (the data are not). Not all of them, the time of "revert" is also not fixed and even if I use /dirty I can't be sure when the newer version will be back.

Maybe some caching/load balancing problems? I thought it was a Firefox issues with its cache, but Chromium was no different.

matthijsmelissen commented 9 years ago

I agree that the full rendering queues on style changes are an issue.

I have no idea how much of a pain this would be to implement on the server side, but what about rolling out style changes gradually? For example, rolling them out by zoom level (using a moving cut-off zoomlevel below which the new style is used and above which the old style is used), or by longitude (using the new style west of a certain longitude and the new style east of it).

By the way, I think this is primarily an https://github.com/gravitystorm/openstreetmap-website issue rather than a https://github.com/gravitystorm/openstreetmap-carto issue.

tomhughes commented 9 years ago

No it bloody well is not an https://github.com/gravitystorm/openstreetmap-website issue. It might be an operations issue but it certainly isn't an issue with the web site code.

There seems to be a huge amount of misinformation going about anyway. I see people on IRC think for example that not re-rendering the lowzoom would somehow help. It wouldn't, as that happens in the background before the style is released more generally, and in any case user requested tiles take priority.

What would help is if you guys could get it right first time so we don't wind up pushing an emergency change on a Sunday evening leaving us without the quiet time of the weekend to get started on catching up...

Rovastar commented 9 years ago

Easy Tom. There is no need for that. Firstly it is not strictly a carto issue either. This project is about a style. The process about if and how it is rendered are not defined but this github project which was in existance before it was included at osm.org

this project is constantly changing. I am not sure much discussion has ever come about for the processes around it and not entirely sure they belong here. In many way the website github would make more sense.

Maybe a more structured release process is needed. With slot for releasing stuff and a process for doing mod emergency chances. This will apply to the website and carto changes. Ideally everything through qa/staging server and approved and signed off.

Bitching about everyone else has to be perfect is not a helpful situation. I am sure your glass house wouldn't like that.

matthijsmelissen commented 9 years ago

No it bloody well is not an https://github.com/gravitystorm/openstreetmap-website issue. It might be an operations issue but it certainly isn't an issue with the web site code.

You're right, seems I'm misinformed about the scope of openstreetmap-website. The operation side does not have an issue tracker, does it? But my point was more, except for the 'less frequent releases' solution, all other solutions are things we have no control over. That doesn't mean that we don't want to help discussing a solution, of course.

What would help is if you guys could get it right first time so we don't wind up pushing an emergency change on a Sunday evening leaving us without the quiet time of the weekend to get started on catching up...

Yes, this definitely is an openstreetmap-carto issue. We really need a system for unit testing. In any case thanks for the quick response!

imagico commented 9 years ago

It seems to me the problem is not so much the high frequency of planned releases but the need for another bugfix release soon after (happened with nearly all releases since the start of this year). The obvious solution here would be more thorough testing. With the amount of development activity currently occuring this to me seems crucial to maintain quality. Also keep in mind that for example in 2.28.1 there were a lot of unfixed bugs that waited until 2.29.0 to be fixed - which was one of the longest times between releases in recent history.

Of course the ideal thing would be to have a global test environment as it has been discussed several times already as something nice to have. But IMO a smaller scale test deployment with just a few sample areas would already be sufficient to detect most of the bugs that occured recently.

Another suggestion: Maybe those making the pull requests should be a bit more thorough in testing their own changes before asking to have them merged and the style maintainers should wait a bit longer before merging things (there have been quite a few cases recently when a PR was merged just a few hours after it was made leaving no time for a community review).

And another thing: maybe announce releases a few days before they are rolled out and call for everyone interested to test it in their local test environments.

tomhughes commented 9 years ago

To explain a little bit how we do the rollout:

We update the style on the server
We kick off a background render of z0-z12 but user requests should take priority
We touch the "planet file" which is really just a timestamp

We normally aim to start in the early evening so that the low zoom update can run overnight. Ideally a Friday evening is chosen so the high zooms can start updating over the weekend when load is lower.

Once the "planet file" is touched tiles will consider themselves out of date if they are older than that file and the server will try to rerender them if possible. If it can't it will serve the old one still.

The big problem is unrendered tiles, as they are competing with out of date tiles for resource, but we don't have the option of serving old data if the render takes too long. I think there is some prioritisation of unrendered over out of date but I'm not sure how well it works.

Another problem is people that want to see changes they make render but it doesn't happen because that tile is only considered old and hence non-existent tiles take priority.

If step 3 wasn't done tiles would only rerender when the data on them changed or they hit their natural expiry (if they have one - not sure about the details there).

One thing that might be good is to make tiles with changed data take priority over tiles where only the style has changed, so there would be four priority levels:

No tile exists yet
Tile exists but data has changed
Tile exists with same data but style has changed
Background jobs (like low-zoom render)

That may already be what happens, or may need changes to mod_tile/renderd though. I can't really say I understand the details of all the queues there. Maybe @apmon can help?

matthijsmelissen commented 9 years ago

By the way, there was 2,5 hours between the introduction of one of the bugs and my bugfix. The new version was tagged in this 2,5 hour interval. I knew the new version was expected to be rolled out that week, but I didn't know in advance it would be in this 2,5 hour interval.

So this process we could certainly improve (cc @gravitystorm). A couple of days feature stop, or at least an announcement in advance when the new version would be tagged, would be helpful.

tomhughes commented 9 years ago

@math1985 There is an operations tracker at https://github.com/openstreetmap/operations/ but we do already have https://github.com/openstreetmap/operations/issues/5 open for fixing the issues with yevaud which is where I think the worst of the problems are.

Actual changes to how the deployment is done are really about the chef scripts, which are managed though https://github.com/openstreetmap/chef

tomhughes commented 9 years ago

@math1985 Nobody told me about it though... I was aware a point release had been proposed, and I checked at that point but it hadn't been tagged yet. It was apparently tagged shortly thereafter but nobody told me until last night or I would have rolled it out earlier.

I wasn't told about the 2.29.0 release either BTW but I learnt about that from the mailing list post and rolled it out.

matthijsmelissen commented 9 years ago

rolling them out by zoom level (using a moving cut-off zoomlevel below which the new style is used and above which the old style is used), or by longitude (using the new style west of a certain longitude and the new style east of it).

@tomhughes What would it take to implement this? Does this require mod_tile development, or would it be possible on an operations level to divert different tiles to different osm-carto versions?

matthijsmelissen commented 9 years ago

@tomhughes As far as I'm aware, the process is currently that only @gravitystorm as lead developer tags releases, and contacts you. Perhaps also a process that needs reconsidering in case of emergency releases.

@imagico I try to merge code rewrites without much consequences quickly, and leave more substantial PRs open. I doubt someone would have voluntarily reviewed the commit that introduced the bug.

tomhughes commented 9 years ago

It's not really the osm-carto version that matters, it's how it decides the tile is out of date.

Currently there is a file (defaults to planet-import-complete in the top level of the tile directory) whose timestamp is compared to the tile. As you can tell from the name it was originally used to indicate when a new planet had been imported in the days before rolling updates. Today we use it to indicate when the style has changed.

If that existed per-zoom then obviously we could stage things, but that would need support in mod_tile/renderd.

tomhughes commented 9 years ago

@math1985 I believe both releases were done by @pnorman this time, and he just didn't realise he need to let me know about them. He does now ;-)

gravitystorm commented 9 years ago

OK, lets start with the easiest one - release notification. From now on, when a release is made whoever has done so will create a ticket on https://github.com/openstreetmap/chef indicating the new version number. Then any of the OSMF sysadmins can choose to update the deployed version at whatever time they see fit. This scales beyond just me and @tomhughes notifying each other, in case either of us are unavailable.

gravitystorm commented 9 years ago

Bigger releases, less often

We're going to stick to the same release frequency here, which is currently targetting fortnightly but is actually more like 3-4 weeks. If OSMF decide that, for operational reasons, they want to do less frequent roll-outs to their servers or delay a particular deployment to a convenient time of the week, that's their choice.

I come under a lot of pressure for not releasing fast enough as it is, and I don't want to release less frequently than we currently are, nor do I want releases to be based on OSMF tileserver rendering queue status - that would be the tail wagging the dog!

matthijsmelissen commented 9 years ago

I come under a lot of pressure for not releasing fast enough as it is

If you're referring to me, my main request was to accept pull requests faster (because of merge conflicts). Releasing fast is in comparison less important. Personally I'm fine with, for example, monthly releases.

gravitystorm commented 9 years ago

the need for another bugfix release soon after (happened with nearly all releases since the start of this year)

Yes, this is a problem that we need to get on top of, while remembering that we're dealing with a large number of changes (132 commits in the last release).

We could go back to the previous method, where I hand-reviewed every change, but that wasn't scalable. We can certainly be more careful with our PR merging. And as I've said before, we need unit-tests (and not a dev server or large test areas or whatnot).

A couple of days feature stop, or at least an announcement in advance when the new version would be tagged, would be helpful.

It would be nice, but I don't think either are workable. I find it very hard (i.e. impossible) to know 2 days in advance when I'll definitely have time to test the release before tagging it, so we'd face extended times the feature stop ends up being 5-7 days and I think that would infuriate everyone. We need to keep master as being releasable 24x7, which means cutting down on errors in PR merging (i.e. by having unit tests). As a last resort, I'll move to release branches i.e. branch, wait a few days and tag it some point after that, but I'd prefer not to do so since I think we'll still have the same problems.

But what we need is unit tests. And that means I need to go and fix ruby-mapnik as a priority.

gravitystorm commented 9 years ago

Finally, there's the issue of how the stylesheets is optimally deployed to balance the competing needs of refreshed tiles, tiles with new data, missing tiles, deployment at weekends, balancing rendering load between servers and upgrading servers. All of these are out-of-scope for this repo, and are up to OWG to sort out on https://github.com/openstreetmap/operations

imagico commented 9 years ago

@math1985 - i did not mean to point fingers at anyone with my suggestions, i am fully aware that these are no guarantees to avoid bugs and just because a bug got introduced in someone's change does not mean it was his/her lack of thoroughness. But it would not hurt IMO if everyone involved would keep in mind the responsibility that comes with developing the style in active use on osm.org and elsewhere when working on or merging changes.

BTW someone should fix https://github.com/gravitystorm/openstreetmap-carto/releases - 2.28.1 is still indicated as latest release there.

matthijsmelissen commented 9 years ago

BTW someone should fix https://github.com/gravitystorm/openstreetmap-carto/releases - 2.28.1 is still indicated as latest release there.

Done.

systemed commented 9 years ago

Useful and fruitful discussion: thanks all.

Should I therefore open a corresponding issue on https://github.com/openstreetmap/operations?

matthijsmelissen commented 9 years ago

And that means I need to go and fix ruby-mapnik as a priority.

You could also have a look at @matkoniecz's tool.

tomhughes commented 9 years ago

@systemed I don't see anything currently that would be appropriate for an operations ticket.

matkoniecz commented 9 years ago

@matkoniecz's tool.

At least for me it is very useful. It was motivated by some of my early pull requests doing other things than I expected, #1037 convinced me that proper testing is necessary and doing it manually is asking for a failure. And I think that this tool is helping like I expected (but I admit that I hoped for real-time generation of visual diffs, currently it is orders of magnitude too slow to do this).

https://github.com/mkoniecz/CartoCSSHelper resulted in multiple bugs found before I proposed code for merging, some issues were also found in proposed pull requests that I reviewed. Sometimes I am finding new bugs introduced to master. For example I noticed [tourism=attraction, =] regression during testing of an unrelated change.

During testing of #1429 I discovered that [tourism=viewpoint; natural=peak] regression and missing icon of cave icon. And during testing change that was supposed to restore cave icon I discovered additional regression (trees blocking labels).

Making it gem and adding documentation is one thing that is necessary to make it usable for others, but additional issue is that it results in peculiar workflow.

Preparing tests is really quick (within minute for a typical pull request, up to 5 minutes for complex ones changing many features).

But generation of before/after takes significant time so it is necessary to wait hours or even days for results. For me it is not a problem, but it is likely that not everybody would be happy about it. For example my review of of #991 was so long because I was waiting for generation of diff images for multiple tag/element/location variations.

matthijsmelissen commented 9 years ago

What is the reason rendering takes so long?

apmon commented 9 years ago

Just a clarification on rendering priorities (Assuming things haven't changed since I last looked). There are already the 4 priorities @tomhughes suggested.

If you look at e.g. http://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap/renderd_processed.html you will see there are

1) Priority request queue - This renderes all missing tiles 2) Request queue - This renders all of the tiles in need of update 3) Low priority queue - This renders the tiles that are outdated due to the style change (older than planet-import-complete) 4) Bulk request queue - This renders the background low zoom tiles.

All of these queues were designed to mostly be able to render in real time and thus have a very short maximum queue size of 32 (meta)tiles. If these are full, then the requests overflow into another queue, the "dirty queue", which has a maximum capacity of 1000 meta(tiles). The dirty queue is in priority between the low-priority queue and the bulk queue. And so once the requests overflow from their original queue into the dirty queue they loose their prioritization.

The system is a strict priority queue. So as long as there are requests in a higher priority queue, no tiles from the lower priority queues get rendered, and they can get starved indefinitely. Hopefully the rendering capacity of the servers are high enough for the priority queue and render queues to be empty most of the time and so ensuring that tiles from the dirty queue get rendered. But after a style change, the low-priority queue will be full for a long time, and so any tile that gets bumped into the dirty queue will be delayed for a potentially long time. Furthermore, if a tile is re-requested while already in the dirty queue, it won't ever get bumped up to the higher priority queues, even if there is now room for the tile, and there is no other way than to wait for it to get rendered in the dirty queue.

However, looking at the munin graphs of Yevaud, the issue might somewhat be a different one. In addition to all the queueing stuff, there is also a "load protection" builtin. I.e. once the load of the rendering server goes over a certain threshold, all the on the fly stuff gets suspended and all requests no matter what get put into the background (dirty) queue.

The idea behind it was, that if the load is so high, things won't get rendered in time anyway, so no need to even try the on the fly rendering and instead try and minimize rendering to give the server a chance to recover.

However, with I/O wait time counting towards load on linux, and with Yevaud not only doing the rendering but also serving hundreds of tiles per second, and thus under intense I/O load, the load numbers look "ridiculously" high, suspending priority queueing, even though the system is running in normal conditions.

As this really is out of the scope of the openstreetmap-carto project, is there a better place to discuss these aspects?

tomhughes commented 9 years ago

@apmon Either the operations tracker or the chef one I guess - can the "load protection" be turned off, or adjusted at all?

apmon commented 9 years ago

The "load protection" is controlled by two directives in the apache config of mod_tile

https://github.com/openstreetmap/chef/blob/master/cookbooks/tile/templates/default/tile.conf.erb#L16

looking at those, the prioritization of outdated tiles is suspended at a load of 36 and the prioritization of missing tiles is suspended at 72. With the load on yevaud currently beeing >200, those prioritizations should indeed currently be off, probably causing the problems.

As this "load protection" really doesn't work correctly, it is probably best to set those values really high. e.g. 1000. I have always intended to remove that code, as I think now a days it is doing more harm than good. But I never got around to it.

matkoniecz commented 9 years ago

@math1985

What is the reason rendering takes so long?

On average 24 seconds are needed to export single static image on my hardware ( https://github.com/matkoniecz/CartoCSSHelper/blob/71fa841ed0bed5102211014b89760e043f1cc589/image_generator.rb#L107 )

In case of water features request I generated visual diff for every variations of:

waterway=lock/weir/dam/lock_gate, lock=yes, man_made=pier/breakwater/groyne (x8)
nodes, ways, ways with area=no (x3)
without requirement for name (de facto without) and with name (x2)
each variation was run for synthetic data (single element) and seven locations with real data (x8)
separate image for every zoom level (x19)
everything is generated twice - once for before images and once for after images (x2)

Rendering time is 832819224s = 350208s (more than 4 days).

There are also other factors. For some locations (Mecca in case of water features) there is no nearby feature so rendering of real world data was sometimes skipped. But also - this is purely rendering time, without accounting for Overpass queries and loading data into database.

matkoniecz commented 9 years ago

Maybe for start we may make before/after images mandatory for every pull request?

matkoniecz commented 9 years ago

Note - after https://github.com/openstreetmap/chef/commit/9791e8b8d9e5b54ad87e5c2623c2f28a5a6f1260 priority of rendering should work better. Graphs at http://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap/renderd_processed.html are indicating that "everything is in dirty queue" problem is now solved.

systemed commented 9 years ago

Viewing unrendered areas is indeed working much better now. Big thanks to @apmon and @tomhughes for that.

pnorman commented 9 years ago

Discussion seems to be resolved, so closing.

gravitystorm / openstreetmap-carto

Release policy and/or infrastructure #1458