Closed pnorman closed 8 years ago
Prior to the setup of orm a number of options were discussed, but they all had problems with limited bandwidth between sites. Is this still an issue?
With three servers we could have three rendering locations and would need to transfer data from any server to any other server. What are the weakest links in the connections?
I am defining server as a server running the tile store, renderd, and mod_tile. This could actually be split up and a site as a location where there are local connections between the servers.
I think we probably want the following characteristics in a setup
For both simplicity and minimizing inter-site bandwidth, I think it's best to have
If we had multiple servers at one site we could look at more complicated tile replication strategies, but I think with only 3 servers it's not worth it.
Your description there sounds a lot like Bittorrent Sync.
renderd supports tile stores on memcached and rados/ceph. I believe ceph supports what we need, but I'm not 100% sure. I see some potential issues with ceph
Given that renderd has support, this might be the best option.
Renderd also supports distributing rendering requests between multiple renderd instances, but I don't recommend this
Going this direction is also consistent with vector tiles - most of those implementations use some service as a vector tile store and often have it running on different servers.
Yesterday, yevaud rendered 963,135 distinct metatiles and orm rendered 859,036 of which 303,923 were the same. If only one copy of each of those were rendered, that would be an overall saving of 17%, which is nowhere near as large as I'd have hoped.
Also surprisingly (at least to me) is that the duplication is relatively stable across zoom levels. I'd have expected far more duplication at low zoom levels than high. But perhaps people are as likely to look at details in a map of somewhere far away as they are nearby - or perhaps those are just less likely to be in cache.
Interesting data, and means that a 3rd machine will likely be more useful than I'd previously thought.
Yesterday, yevaud rendered 963,135 distinct metatiles and orm rendered 859,036 of which 303,923 were the same. If only one copy of each of those were rendered, that would be an overall saving of 17%, which is nowhere near as large as I'd have hoped.
If a single server has a capacity of 1, then two have a capacity of 1.66.
If you assume that the statistics remain the same and that when rendering a tile there is a 17% chance that a specific other server has the tile, if you go to three servers then for a request there is a 31% chance one of the two other servers has the tile. With each server spending 31% of its capacity duplicating work, the total capacity is 2.07, an increase of 25%. If everything was distributed ideally it would be an increase of 50% (2x actual)
My gut tells me that the statistics will not be the same and 3 servers will be slightly better than this model, but it gives us a place to start.
With four servers, it is a 43% chance of duplicating work and a total capacity of 2.29, an increase of 11% instead of 33% (3x actual).
So we could set up three servers and not be too badly off for duplication, but beyond that it gets worse.
cc @apmon about renderd
If we cut our duplication in half by 8.5% we'd gain 10% for two servers, 21% capacity for 3 servers, and 34% capacity for four servers.
Are there any tweaks to load balancing that make sense without removing the ability to have a server fail and its load get redistributed?
This is probably a naive suggestion, but what about making the CDN use some sort of hash (even as simple as (x+y)%n) to decide which server will render a given tile? That would reduce duplication of rendering work and caching, since a tile would always be requested from the same server. The hashing would need to take metatile sizes into account; otherwise it would send requests for different tiles in the same metatile to different servers and make things worse than now.
It still has to support failover to another server if the hash-chosen server is down or taking too long, which would add some cache duplication, but it would be minimal compared to now.
Is this ticket in the right place? It's not clear to me that this is a "chef" issue at the moment because it's not something we can solve just by writing a chef recipe...
making the CDN use some sort of hash (even as simple as (x+y)%n) to decide which server will render a given tile?
The problem is that if you then have a server go down your tile store hit rate goes down because the hit rate is 0% on the new load. You can use something like that to distribute work, but you still need a shared tile store.
Is this ticket in the right place? It's not clear to me that this is a "chef" issue at the moment because it's not something we can solve just by writing a chef recipe...
What would be a more appropriate place? It's eventually going to result in recipe changes
What would be a more appropriate place?
Usually the operations tracker is where high-level things like hardware resource allocations, budgeting etc are considered.
Closing in favour of https://github.com/openstreetmap/operations/issues/101 - where it's more appropriate and we can track aspects of this which aren't just Chef-related. Apologies to anyone following the breadcrumbs :disappointed:
From https://github.com/openstreetmap/chef/pull/78#issuecomment-239563299
Currently the two servers are independent, and clients go to one based on geoip. This means that the rendering workload is not fully duplicated between the two servers, as users in the US tend to view tiles in the US and users in Germany tend to view tiles in Germany. This has been tested by swapping locations and seeing an increase in load.
Unfortunately, this doesn't scale well to higher numbers of servers.