openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Change to tiles acceptable usage policy #113

Closed zerebubuth closed 7 years ago

zerebubuth commented 8 years ago

The usage of tile OSM tile infrastructure by the OpenStreetMap website was recently measured at approximately 11% of tile server output, so the vast majority of OSM's tiles are rendered to support 3rd party sites and apps. Although we want to support use of OpenStreetMap data, there are some use cases for which OSM and the surrounding ecosystem receives little or no benefit.

It has been suggested that OWG introduce two new rules to the tiles acceptable usage policy:

  1. That only open source apps may use the tile server. (Alternatively, and more strictly; that commercial apps may not use the tile server.)
  2. That only publicly-accessible sites may use the tile server. For example, sites requiring a login or accessible only from behind a firewall or VPN would not be public, and therefore would not be allowed to use the tile server.

The technical implementation of this is a detail, so please keep discussion to the policy of whether or not we want to begin to restrict usage in this way.

iandees commented 8 years ago

A couple thoughts:

  1. The discussion on this policy should be taken in the context of the recently-announced corporate membership program. There will be claims that this policy is an effort to direct money to potential corporate members. This policy should explicitly not recommend any alternatives, or if it does, have very clear reasons for recommending them.
  2. With that context in mind, it might be good to have a discussion about if the operations team and OSMF want to allow use of OSM.org tile infrastructure by commercial entities or in commercial situations (ad-supported or for-pay apps or sites).
  3. Although it might be considered a technical implementation detail, the policy should describe in general what the repercussions are for not following the policy. This would help people understand why they are blocked and help alleviate the inevitable questions on IRC.
lonvia commented 8 years ago

Give that Nominatim has similar problems, I would suggest to include the search acceptable usage policy in the discussion. Whatever the outcome, we should apply the same rules for tiles and search.

gravitystorm commented 8 years ago

Although I have views on this issue, given my conflicts of interest I recuse myself from this topic.

zerebubuth commented 8 years ago

@iandees thanks! Those are very good points.

  1. The motivation for the proposed changes to the policy have nothing to do with corporate membership and, while I'd encourage any organisation to join OSMF, I'd agree that any such membership should not imply that any recommendation would be given. In general, we would point to switch2osm to provide anyone with further information about alternative sources of tiles - either other providers or hosting their own.
  2. Indeed, that's exactly the discussion I was hoping to have! The proposal frames it in terms of "open source apps" and "publicly-accessible sites", but it could potentially be more specific about "ad-supported" or "pay-for" apps or sites.

    I think most people would be happy for as many people as possible to see and use OpenStreetMap tiles, but we aren't able to practically support that. While there are on-going efforts to increase capacity and we welcome donations of tile caches and rendering servers, it's possible that some uses of OSMF tile infrastructure are less beneficial to the community than others.

    For example, one might think that use of OpenStreetMap tiles in a closed-source, pay-for app is less deserving of support than use in open-source, free apps. And therefore, we'd prefer to start restricting use in those sectors first, rather than just an overall restriction on all users as we max out our capacity.

  3. I would love to have that discussion as a follow-up to this one. You are right that the method and repercussions are very important. Documenting them and explaining them clearly is very important. However, I worried that this discussion would become hard to follow if we were simultaneously discussing the what at the same time as the how. Since the what would likely inform the how, I felt it was better to have this discussion first.

PS: If anyone has strong feelings about this, please comment - this is an open issue and OWG wants to hear your views.

systemed commented 8 years ago

the vast majority of OSM's tiles are rendered to support 3rd party sites and apps

Do we have any (approximate) breakdown[1] of sites vs apps; and any breakdown of how many sites use what % of capacity?

My rough impression as an interested onlooker is that there are three broad categories of user, other than mapping tools:

  1. popular sites with in-browser maps, e.g. Pokemon Go (problematic levels of use)
  2. apps (very often problematic levels of use)
  3. a long tail of tiny sites with in-browser maps (not problematic levels of use)

I'm not overly bothered about 1 or 2, but to cut off 3 would concern me. Having OSM tiles on (say) a small shop's "Where we are" page is good visibility for OSM yet minimal load on the servers, even in aggregate. OSM also has an wider social role in encouraging people to "use [maps] in creative, productive, or unexpected ways", and to prevent small sites from using our maps as soon as they add their first AdSense embed or affiliate link would damage this role.

On a broader issue, I worry that this might have the effect of making the best-funded service companies into gatekeepers for the OSM ecosystem. If long-tail users are immediately required to sign for with a 'starter plan' with a services company, this may reduce their attachment to OSM per se and to the broader ecosystem.

But if I'm way off on the numbers then this is all moot.

(And thank you for soliciting wider contributions! :) )

[1] I ask this knowing how much @zerebubuth enjoys a bit of numerical analysis

simonpoole commented 8 years ago

I would take issue with the underlying hypothesis that OSM does not benefit from use of the OSMF provided services in "closed" environments, I can see no logical reason why this should be the case and would suggest that this be substantiated before changing rules in a way that might effect a larger number of users short term.

Note: if we do change something we should take https://github.com/openstreetmap/operations/issues/114 in to account

Komzpa commented 8 years ago

I think this is really shortsighted proposal. It for sure will get tile usage levels lower, but it will also move lots of people away from openstreetmap at all.

Sending people to go read switch2osm when they're using osm.org tiles sounds like 'go away'. Switch2osm was started as a guide how to use OSM - not as a way to say "don't use osm, go install your own osm". Google's prices aren't that high compared to self-hosted OSM.

Speaking for myself: I'm lazy nowadays. I know how to set up tiles and did that numerous times. I have two options now:

  1. go, buy a server, set up tiles, utilize 0.001% and be happy self-customer;
  2. go, buy a server, send credentials to operations@, see it used at 90% by all the osmers (including my country, yay!) and be a happy tile.osm.org consumer. I went the second way now. This makes me happy - I use a lot of software with hard coded tile.osm.org without that guilt feeling that is being recurrently planted by "tiles are only for osm.org, go and buy mapbox instead" usage policies. How about we just embrace this behavior more?

"Commercial apps can not use the server" - we're using QGIS with QuickMapServices plugin. If we buy commercial support for QGIS, helping its development, we cannot use osm.org tile cluster anymore?

We're using Carto tiles as background for (hundreds of) GPS traces that had issues with routing, and we fix osm.org roads each time we identify an issue with them. Should this be banned, as it is business usage, or should this be embraced, as it is about improving the map, finding rare and obscure issues?

What we lack is transparency, IMHO:

We're trying to have social enforcements where there are technical solutions. There were rumors about Varnish migration. We're not now bandwidth limited, we're rendering-power limited - how about implementing logic of:

mtmail commented 8 years ago

Thank you komzpa for running a tile server.

I believe identifying top websites using osm.org tiles and asking them to donate will send the wrong message. What if a website refuses for one reason or another? Or donates very little? Or donates once and never again in the following years? Would we then switch off access? That model would be seen as "give us money or else". Same with returning tiles "out of capacity, donate".

https://switch2osm.org/ includes instructions how to setup a tile server but also lists tile provider companies. Pointing users to the website doesn't mean one has to run their own tile server.

Not being bandwidth limited currently is lucky as bandwidth (at least for the hosting in London) is donated.

We're using Carto tiles as background for (hundreds of) GPS traces that had issues with routing, and we fix osm.org roads each time we identify an issue with them. Should this be banned, as it is business usage [...]

Isn't that up to CartoDB's tile policy? But you raise a good point here: editors (people fixing the map) should ideally never be restricted.

matkoniecz commented 8 years ago

That only publicly-accessible sites may use the tile server. For example, sites requiring a login or accessible only from behind a firewall or VPN would not be public, and therefore would not be allowed to use the tile server.

This would also ban for example my local file that makes map of places with missing OSM tags (I make this map to find places for mapping) and all kind of things that are in development, even ones intended to be open for public.

for reference - my file, version showing places with bicycle parkings missing in OSM, present according to data released by city Kraków: https://gist.github.com/matkoniecz/bac244f38693f307b3560e4e71bf8e04

matkoniecz commented 8 years ago

Do we have any (approximate) breakdown[1] of sites vs apps; and any breakdown of how many sites use what % of capacity?

This would be highly useful. I suspect that there are some sites/apps/scrappers like Pokemap with very high usage and long tail of tiny/optimized sites with acceptable levels of usage.

matkoniecz commented 8 years ago

The usage of tile OSM tile infrastructure by the OpenStreetMap website was recently measured at approximately 11% of tile server output

Can you give link to that (I am curious how it was done from technical side).

zerebubuth commented 8 years ago

The usage of tile OSM tile infrastructure by the OpenStreetMap website was recently measured at approximately 11% of tile server output

Can you give link to that (I am curious how it was done from technical side).

I'm afraid it was done in the simplest and least repeatable way possible: Taking the tile render server Apache logs and running awk -F '"' '{total += 1; if ($4 ~ /openstreetmap.org/) {osm+=1;}} END {print osm " " total;}' on it. For example, on yesterday's logs on yevaud, it gives 9112308 78726601, which is 11.6% - slightly higher than last time I ran it.

dbf256 commented 8 years ago

I've read that this thread is for policy discussion, not for tech details, but just for better understanding I would like to know is it possible to deploy a configuration that will have 2 separate flows. One is for core activites (like mapping editors, tiles at openstreetmap.org etc) with higher quality of service and not for 3rd party usage; and another flow for all others where service is provided on a best-effort basis (if there is a capacity for it)?

I agree with Komzpa that it is better to have a technical solution (if possible) - rather than a legal once. Because in this case we will be able to block users that are a source of an issue (bandwidth usage etc - it could be a non-profit website like fastpokemonmap), rather than that ones that are harmless but are closed-source/commercial, for example.

zerebubuth commented 8 years ago

Do we have any (approximate) breakdown[1] of sites vs apps; and any breakdown of how many sites use what % of capacity?

My rough impression as an interested onlooker is that there are three broad categories of user, other than mapping tools:

  1. popular sites with in-browser maps, e.g. Pokemon Go (problematic levels of use)
  2. apps (very often problematic levels of use)
  3. a long tail of tiny sites with in-browser maps (not problematic levels of use)

I'm not overly bothered about 1 or 2, but to cut off 3 would concern me.

Here's the breakdown based on renderer requests on yevaud for the preceding 24h period - this is cache misses, so will undercount any sites whose tiles are served entirely from cache. But those aren't the ones which cause us a problem anyway.

Hits Referer domain %age
34113602 - 43.81%
9141929 *.openstreetmap.org 11.74%
1672876 not a domain 2.15%
1512934 trackmytaxi.com 1.94%
980502 easymap.land.moi.gov.tw 1.26%
30447314 23,995 sites smaller than 1% usage 39.10%

The - entry is for things which didn't send a referer, and I've put a breakdown of those below. The not a domain entry is for referers which are IP addresses or localhost, or things which are not valid URLs.

Unfortunately, the 23,995 sites in the long tail together make up almost 40% of usage, yet are a very long, smooth tail:

image

Hits User agent %age
5167262 Android 6.64%
2586501 osmdroid 3.32%
2560810 - 3.29%
2211556 OruxMaps 2.84%
2136972 CFNetwork 2.74%
1830649 libwww-perl 2.35%
17619852 others 22.62%

Where - means that no user agent header was sent and others represents the sum of all the remaining requests. The percentage is relative to total requests. Similar to the referers, there's a long tail of smaller user agents. 15% of the usage comes from things which are not individually distinguishable, for example the generic Android user agent, -, "CFNetwork" which is the generic iOS user agent, or "libwww-perl", the generic user agent for scrapers written in Perl.

  1. a long tail of tiny sites with in-browser maps (not problematic levels of use)

I'm not overly bothered about 1 or 2, but to cut off 3 would concern me.

The sum of the "long tails" for user agents and referers is 61.72%, which means that almost 2/3rds of our requests are in your 3rd category. When taken individually, they are not problematic levels of use. As a whole, they do cause a problem.

I think everyone would agree that we want to support mapping activity. The requirements for supporting this are clear; we want quick updates and lots of detail.

The problems start to appear when we try to support generic web mapping use-cases, which have a different set of requirements; they are not so bothered about updates, and favour "less cluttered" maps over more detailed ones. In fact, it's better for the generic web mapping use case to update less frequently, as this means better cache hit ratios and faster load times. "Less cluttered" map tiles tend to be smaller, making them faster to load and making better use of edge and device caches.

It is possible support all of this, with more hardware and more donations. However, using discretionary spending (and to a lesser extent, donated hardware) to support generic web mapping means we have fewer resources to support other OSMF-run services, such as the API, planet, Nominatim, etc...

matkoniecz commented 8 years ago

Maybe for start blocking things like

15% of the usage comes from things which are not individually distinguishable, for example the generic Android user agent, -, "CFNetwork" which is the generic iOS user agent, or "libwww-perl", the generic user agent for scrapers written in Perl.

? 15% reduction is a big part (though most would probably start using proper identification rather than disappearing) and it would encourage proper identification by users. And maybe there are big, inefficient users hiding behind generic user agents.

matkoniecz commented 8 years ago

@zerebubuth

trackmytaxi.com easymap.land.moi.gov.tw OruxMaps

Maybe it would be a good idea to ping them about this discussion?

zerebubuth commented 8 years ago

Maybe for start blocking things like ... ? 15% reduction is a big part (though most would probably start using proper identification rather than disappearing) and it would encourage proper identification by users.

That's a good idea, and what we've been doing for a few years now; tightening access requirements and trying to find "abusive" users. We're at the point where the vast majority (~62%) of tile accesses aren't coming from a small number of abusive users or apps, instead they come from a huge pool of small sites and apps which aren't directly related to OSM - other than using it as their free map.

There might be some benefit to having all those sites displaying attribution to OSM (assuming they do, of course). On the other hand, there's a clear trade-off between using resources for tiles and for other things more directly related to mapping activities.

Komzpa commented 8 years ago

I think there's an issue of trying to make savings instead of making growth.

Can we translate all these numbers into costs, according to, say, Amazon's pricing?

I've installed Orux Maps - they do show "Donate" dialog, proposing to donate them. Given Google Play reporting 10 000+ "donate app" installs with $2,74 price that translates into estimated income of $20 000. Having vial.openstreetmap.org up translates for me into 147 EUR / month, which is ~$161 / month. If we kindly ask Orux to provide similar server, even now they can support it for up to 10 years. This would give us capacity for 10x the amount Orux uses.

Orux has GPS tracking support. It might be hard to trace directly, but osm.org has GPX upload feature - how many of those are from Orux users?

If we have need for capacity in other fields, then please light those fields up as such.

matkoniecz commented 8 years ago

That's a good idea, and what we've been doing for a few years now; tightening access requirements and trying to find "abusive" users.

But is there any technical problem with blocking remaining uses without proper user agent?

Komzpa commented 8 years ago

But is there any technical problem with blocking remaining uses without proper user agent?

I think it also has social impact that should be well-thought.

How exactly blocking is performed? Does it have just generic "Access blocked" message, or "Get your developer send correct user-agent, and donate"? Just getting "Access blocked" makes people upset, and gives impression that OSM is just a greedy project that doesn't want to share tiles even though it could (it all worked, then stopped, because sysadmins!).

oruxman commented 8 years ago

Hello all; from OruxMaps:

I think that the benefits of OruxMaps are a bit overestimated. There are approximately +20,000 donations, but along 7 years. The donation was 2 € until last year. For Google is the 30%, VAT 21%, and the additional 25% tax approx. for my country, + paying a hosting,... The calculation of the current profit is now more easy. Recently I increased the donation to 3 €, because it almost cost me money.

Sure there are many payment applications that use more intensely osm and do not identify itself.

OruxMaps also allows users to upload the GPX track files to OSM servers. In this way I think that helps osm.

If I have to pay for the services of map servers, honestly I think that would have to remove them from the app, leaving only support for offline maps. There are other excellent servers that the app can not use because of their prices.

I can make specific donations, but if I have to pay for all the map servers, it would be impossible. I could offer users to donate directly to the provider of the maps, or pay affordable prices. I really do not know which is the best solution.

molind commented 8 years ago

While OSM tries to provide recent tiles for everyone, it causes high render workload and low cache hitrate, but recent tiles is needed only by minority of users.

I think current situation could be solved by splitting map cache into two parts: one for signed-in OSM users, with high rendering priority as it works now for everyone, and second for everyone else with big cache size, when tile invalidated in a week or even in a month.

It should reduce tile renderers usage and focus it on editors without blocking hot resources like Pokemon map or something like it.

Komzpa commented 8 years ago

I think current situation could be solved by splitting map cache into two parts

There's no need to split up the cache itself. It can be achieved with rather simple means:

What do you think?

althio commented 8 years ago

My few thoughts...

I mostly disagree with a differentiated policy for open source / commercial / publicly-accessible / behind a firewall or VPN. I think it is a nightmare just to define such a policy and I like better the current one, with rule#1 heavy use. I believe it is very much in OpenStreetMap spirit that commercial use is also accepted. see also https://github.com/openstreetmap/operations/issues/113#issuecomment-256975087

For example, one might think that use of OpenStreetMap tiles in a closed-source, pay-for app is less deserving of support than use in open-source, free apps. And therefore, we'd prefer to start restricting use in those sectors first, rather than just an overall restriction on all users as we max out our capacity.

For me the only sector OSMF/OWG ought to support (more deserving) is anything contribution-related, core OpenStreetMap project; not about any external project or website being open or closed, free or commercial. Just adjust the rate for overall restriction and keep a whitelist for website and core projects.

I like the direction of https://github.com/openstreetmap/operations/issues/113#issuecomment-257238658. I would finally challenge that everyone visiting OpenStreetMap website need recent tiles. That is only true for a contributor looking for feedback. That is not true for all contributors and most likely false for simple visitors looking for a map. So maybe what we need is a bit of UX/UI where tiles are served from 'old cache' routinely but you can more easily request newly rendered tiles with user action.

amandasaurus commented 8 years ago

It has been suggested that OWG introduce two new rules to the tiles acceptable usage policy:

  1. That only open source apps may use the tile server. (Alternatively, and more strictly; that commercial apps may not use the tile server.)

Be careful with "commerical", since it's vague. Is the BBC commerical?

systemed commented 8 years ago

@zerebubuth Thank you; that's really interesting, though makes it clear that there are no easy answers!

Depending how the stats are presented, you could make a case that we are serving 11.74% to osm.org; 22.62% to generic webmapping uses; and 65.64% to scrapers, apps, super-heavy users and other potential "abusers".

My guess is that a non-commercial restriction would be lucky to reduce that 22.62% by much more than 5%, to (say) 17%. To get even that much of a reduction from the 24,000 sites would require intensive policing, which given the scale of the issue would likely be carried out by the community at large rather than by OWG. Given the (shall we say) alacrity with which some community members have approached attribution issues in the past, I fear that would be counter-productive for the goodwill of OSM.

So for an alternative suggestion which might be more closely aligned to the numbers:

where n and 17 are figures decided by OWG.

(Slight side-issue: I wonder how much mapping activity is generated by the 22.62% of webmapping uses. In other words, people or organisations (however defined) who add content to OSM so that they can show it on the map on their own sites. I suspect it's significant in terms of mapping activity, but insignificant in terms of server load; if so, any policy should seek to preserve that linkage.)

Komzpa commented 8 years ago

Tiles are a great advertisement platform for OSM. Seriously, - there are millions of 256x256 banners served daily to thousands of web sites, and OSM does not have to pay for displaying them - just serve.

What do you propose to say about OSM on these banners?

So, while we have this platform, it is essential to send correct message if we want to.

Scraping will soon go away - if scraper sees their cache poisoned by such tiles, they will stop.

nicolas17 commented 8 years ago

Uhh the current tile usage policies already say "Valid HTTP User-Agent identifying application". Why aren't "Android", "CFNetwork" and "libwww-perl" already blocked?

lonvia commented 8 years ago

There are two reoccurring points in the discussion where I strongly disagree:

  1. Keeping up with the increasing demand for free tiles is not just a question of donating money and buying more servers. The servers also need to be maintained. There is already a lot of unpaid time going into this and more server just means more time. We are currently cross-financing a free tile service from donations in time and money that were actually meant to go into core infrastructure for mapping. I find that very problematic. It also means that we are offering a service where it is impossible for a any provider to compete and that kills business for potential new data consumers which could function as a counter-weight to the big 'gate keepers'.
  2. I don't buy into the argument that tiles are an important advertisement for OSM. There is limited recognition value in a small map excerpt that appears on a random page. The only people who are likely to read the attribution are map geeks and they most likely already know about OSM. I would even go as far as claiming that a not insignificant part of the users of our tile servers has no idea where the data comes from. They just found that Javascript snippet that magically made a map appear. So, I don't think we loose much as a mapping project if we loose some of the long tail of users.

I did like the original suggestion because it fits well into the open data theme. From a more practical point of view it might be better to draw a line that is easier to implement. @systemed suggestions are probably better in that regard and would have my support as well (I would even go as far as supporting restricting the highest zoom levels for everybody but I'm not sure how much gain there is).

simonpoole commented 8 years ago

@lonvia on (1) I don't believe there has ever been any clarity on who the intended audience for the services is, It's not even sure that a majority of the 11% tiles requested via osm.org are really mappers or not (likely not) and it might be a good idea, completely independent of this discussion, to think a bit more about that aspect. In the past I suspect there was always the hope that somebody would come along and build a gmaps competitor with OSM data, it is fairly clear now that that is not going to happen.

zerebubuth commented 8 years ago

So for an alternative suggestion which might be more closely aligned to the numbers:

  • No offline downloading or alternatively no offline downloading at z17+

No offline downloading at z17+ is already forbidden. However, it's difficult to detect when a download is for offline use except in the most egregious cases, and the most egregious ones tend to be obvious and have already been blocked.

  • No app use or alternatively no app use at z17+

Many apps are benign, and are simply displaying the user's current location on the map. Some of these apps help with data collection, or are otherwise beneficial for mappers. We do already require app distributors to contact OWG before distributing their app, although few ever do. Detecting app use is initially easy, based on the User-Agent, but I worry that blocking on that basis would encourage faking the User-Agent.

  • Referer must always be sent, tiles blocked otherwise

We already require a valid Referer to be sent, if there is one. I'd be hesitant about suggesting that a referer must always be sent. Browsers do sometimes not send referers (e.g: viewing a single tile on its own), and I wouldn't want to encourage apps to send fake Referers.

  • Authentic User-Agent must always be sent (i.e. not a library or generic platform UA), tiles blocked otherwise

The central issue here is whether one can tell what an "authentic" User-Agent is. At the moment, I think there's not much User-Agent fakery going on, because we don't do much blocking by user agent. If we were more stringent, I think we'd start to see much more fakery.

I think it would end up as a cat-and-mouse game, absorbing a lot of time maintaining the accept/block lists of User-Agents. How can we open up maintenance of these lists so that we can automate more of it and share the burden of the rest?

  • OWG adopts policy to warn, then block, sites generating >n views per day

Warning can often times be hard. Many abusers don't make it easy to contact them. The times I've tried, the email just disappeared into a black hole.

Depending how the stats are presented, you could make a case that we are serving 11.74% to osm.org; 22.62% to generic webmapping uses; and 65.64% to scrapers, apps, super-heavy users and other potential "abusers".

My interpretation is that we're serving something around 6-8% to scrapers and other "abusers", 10-12% to osm.org, 18-22% to a range of "large" sites and apps, and 62% to the "long tail" of sites and apps. My experience of the last few years is that the share to the "long tail" has been increasing as we've acted to block the larger and more obvious abusers, to the point where the "long tail" itself is now the problem.

matkoniecz commented 8 years ago

The central issue here is whether one can tell what an "authentic" User-Agent is. At the moment, I think there's not much User-Agent fakery going on, because we don't do much blocking by user agent

What about "Android", "CFNetwork" and "libwww-perl"? Blocking these will not encourage User-Agent fakery, it will encourage providing valid User-Agent.

Komzpa commented 8 years ago

Problem with "Android" and "CFnetwork" could be that it can possibly be a web-view in an application, that opens when user follows a link instead of system web-browser. Do we have anything like 'twitter', 'telegram', 'vkontakte' in user-agent field? How much of those (and of Android / CFNetwork) are combined with referer?

zerebubuth commented 8 years ago

Do we have anything like 'twitter', 'telegram', 'vkontakte' in user-agent field? How much of those (and of Android / CFNetwork) are combined with referer?

There are a handful of hits from "TwitterBot", otherwise none of those other User-Agents appear in the list.

The breakdown by User-Agent that I gave in https://github.com/openstreetmap/operations/issues/113#issuecomment-257085965 is only for hits without referers. Hits with referers were accounted to the site they referred to.

96% of "Android" hits had no referer. 100% of "libwww-perl" (not very surprising). 99% of "CFNetwork". Although digging a little deeper - the User-Agent parsing library I was using was aggregating a lot of apps together - it seems like CFNetwork includes the name of app which calls it as part of the User-Agent. The largest single app is 35% (of CFNetwork's 2.74%), and there's another long tail from there. The "Android" aggregation includes no such distinguishing information.

grinapo commented 8 years ago

As a sidenote: anyone ever discussed with Wikimedia Foundation whether they could offload some requests to their own tile servers, possibly even officially (eg. offering their tile server as [one of the] "official" tiles)? They (or for myself being involved in both: "we") have much larger financial and computing background, and maybe it's possible to get it done as an ongoing grant or such, especially since WP uses tiles as well. If we are about to think of rejecting so far valid requests in the future.

Another thing occured to me is how simple is it now to start a tile server for someone? Are there pre-packaged appliances for (I'm not familar with them) various cloud providers? How easy is it for a fair playing company to pull up their own server and pay for it without possessing a perofessional map tech guy doing it? Maybe many more would actually do it provided it's painless to do.

Else. @lonvia mentioned that it's painful that some people administer these free servers for free, but I believe that's a financial-administrative problem and not a technical one; if the problem is that there is a cost which isn't covered then this cost should be examined and used as input for decisions. Maybe external entities would donate to that specific case: running public free tile servers. (And risking that nobody would donate to other costs, I must note.)

I am not completely sure that generally rejecting selected users (be that commercial, closed source, or else) would much help OSM in the long run. I'm a tech guy: if there are tech problems (bandwidth, webserver capacity, cache efficiency, rendering capacity, etc.) they should be identified and examined; if there are financial problems they should be identified and examined. Perferably separately first.

When we believe that we have a vague problem of "plenty of specific requests are visible, they must cause some general nonspecific problems, let's see how can we get rid of them" maybe it's not the right path to the solution. It is definitely the easiest, I accept.

grinapo commented 8 years ago

@zerebubuth It's not fair to tag these user agents as "crawlers", since every perl program displaying maps using the standards libs would use that useragent by default, and I guess it's the same for andoid and ios, too. I doubt this is intentional, and I would say education is closer to the solution. (Which would, in optimal case, happen just the same if we was rejecting the generic UAs and the devs would change them to specific ones: the amount of requests would be the same.) This would basically punish the not-very-experienced developers (and more their userbase). Do we want to do that?

cleder commented 8 years ago

As I see it the main problem is that the tile render servers are heavily utilized by 3rd parties (the long tail).

I do not think that these 3rd parties need to have the 'freshest tiles' on every request, a 6 (12, 24?) hour old version of the tiles would be acceptable for these users. This can be achieved by putting tile caches in front of {s}.tile.openstreetmap.org.

For mapping purposes some websites and apps need the 'freshest tiles' so these sites should be served the same way as today. As there are fewer 'official mapping websites and apps' as there are 3rd party consumers IMHO the 'fresh tiles' should be served from something like {s}.no-cache-tile.openstreetmap.org. On these domains a stricter deny/throttle policy can be enforced. The clients for these services have to update their urls.

The suggested no-cache-tile.openstreetmap.org vs tile.openstreetmap.org makes it possible that the long tail can access the osm tiles as before and the instructions on thousands of websites how to configure openlayers or leaflet will be up to date without having to change anything and only a few well known services have to update

I see that the setup and maintenance of the tile caches could be labour intensive and requires of course some new infrastructure. Ideally a corporate sponsor could donate money for the operation and man hours for the setup of the infrastructure and ongoing maintenance of the cache servers.

I have no idea how much effort would be involved in the above. I certainly think that this route should be explored and the costs and effort estimated.

Firefishy commented 8 years ago

We have 20x globally distributed caches. See: http://dns.openstreetmap.org/tile.openstreetmap.org.html , hardware is listed here: https://hardware.openstreetmap.org/#tile-caches

Firefishy commented 8 years ago

The tile.openstreetmap.org uses GeoDNS to point to a local cache to ensure fast response and regionally hot cache. We peak at just over 1Gb/s traffic outbound from caches: http://munin.openstreetmap.org/openstreetmap/tile.openstreetmap/index.html

Firefishy commented 8 years ago

The 20x caches are monitored and automatically rebalanced if there is an outage. Same for rendering backends.

Firefishy commented 8 years ago

To give a scale of the traffic tile.openstreetmap.org services; we current serve over 6800GB/day in over 528 million requests. (average of around 13.6KB per request)

molind commented 8 years ago

The 20x caches are monitored and automatically rebalanced if there is an outage. Same for rendering backends.

Do I understand right that rendering servers use separate caches, and there is no sync between caches?

Firefishy commented 8 years ago

Do I understand right that rendering servers use separate caches, and there is no sync between caches?

The caches do peer with nearby caches.

The stack is: client <-> tile cache server <-> render backend server.

Each tile cache server (20) has memory and a filesystem cache. Each render backend server (3) has a local disk cache and renders on demand if missing.

molind commented 8 years ago

Is there replication of cache between rendering backend servers?

Firefishy commented 8 years ago

Is there replication of cache between rendering backend servers?

No. The render backend servers cache is "hot" for the regions they serve. eg: Sticky, tile-cache-A normally uses render-backend-A, tile-cache-B normally uses render-backend-B... tile-cache-B might switch to render-backend-A if there is an outage of B, but will switch back when B is available again.

Syncing cache between the backends would be non-trivial and a fairly expensive (CPU + IO + Network) operation. 2 of the 3 render backend servers currently use nearly 100% of the available disk space for cache.

yvecai commented 8 years ago

To follow @cleder tile. / no-cache. suggestion, what are actually the cache setting on the CDN servers?

ltog commented 8 years ago

My understanding is, that developers may not always be able to influence neither Referer nor User-Agent, e.g. when setting up a website which loads tiles using Leaflet/OpenLayers.

Could we update the tile usage policy to ask (not require) developers to make their contact data available through their client's requests, e.g. by appending an URL parameter, that will be visible in the log files? Something like https://c.tile.openstreetmap.org/8/133/91.png?contact=admin@coolmap.com ?

We could extend the tile usage policy it like this:

As a courtesy, we ask you to provide us with your contact details. This will give you the following advantages: If the usage pattern of your application is problematic, it gives as the chance to contact you and find a solution instead of just blocking your users. Further, the usage of the contact= parameter may be mandatory in the future. By using it today, you will save yourself the need of adjustment of tomorrow.

If enough developers implement this, we should have a clear picture of who is using how much bandwidth.

The tile usage policy already states

Below are the minimum requirements that users of tile.openstreetmap.org must adhere to. These may change in future, depending on available resources. Should any users or patterns of usage nevertheless cause problems to the service, access may still be blocked without prior notice. We will try to contact relevant parties if possible, but cannot guarantee this.

but obviously it's difficult to contact the responsibles.

gehel commented 8 years ago

@grinapo WMF tends to have a fairly liberal view on third party using our services, so yes, in principle it does make sense for WMF to expose its tiles to anyone, with the usual limitations (we reserve the right to block abusive / excessive traffic, we ask that anyone planning to send statistically significant amount of traffic makes contact first, ...). We still have a few things that needs to be sorted out on our side. You can follow the discussion on https://phabricator.wikimedia.org/T141815.

pnorman commented 8 years ago

My understanding is, that developers may not always be able to influence neither Referer nor User-Agent, e.g. when setting up a website which loads tiles using Leaflet/OpenLayers.

If they're setting up a website then the browsers will send referer headers by default which is fine.

Is there replication of cache between rendering backend servers?

No. #101 discusses this in more detail, I'd say progress on that is in the hands of developers right now, not ops.

mmd-osm commented 8 years ago

Many commercial providers require developers to sign up and get some sort of APIKEY these days for their application / web site. Maybe this could be an approach to better control the overall tile usage, or when referrers and user agents are unavailable and most importantly, to have a feedback channel rather than a black hole.

I don't know if we should go as far as providing some paid plans (and create some direct competition to commercial providers). We could as well just defer power users to commercial offerings if they have used up their quota. Also, paid plans would raise questions about SLA's.

Regarding "legacy" users: maybe we could treat all those clients without APIKEY as some kind of "micro plan" user with very low usage limits, giving developers some incentive for signing up.

On a Mapzen page, I read that those APIKEYs can even be used in a way that they're cache friendly, which was one of my initial concerns before writing this.