openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Adding API key support for tile.osm.org? #342

Closed iandees closed 2 years ago

iandees commented 4 years ago

Hi all!

I'm wondering if you'd be open to the idea of gradually requiring an API key to use tile.osm.org tiles.

I've built a couple simple API key systems (Nextzen tiles being a good public example) and it's not too hard, given access to logs (for counting accesses per API key) and a spot somewhere along the serving chain to block keys. I'd be willing to spend some time working on this, but wanted to get general ideas squared away before I started thinking more about it.

If this seems interesting, please read on. If you think API keys are a non-starter, let me know why.


API Key Website

I'd start with building a service that does simple CRUD of API keys, users, etc. This would be separate from openstreetmap-website, but users would login with their OSM credentials (probably via OAuth). Each user could generate as many API keys as they wanted, give them names, limit the referrers/origins for the key, disable, then delete API keys. This website might also be where metrics on key usage would be displayed.

Access Counter

Once the API keys are generated, a separate piece would grep the logs at an interval and count the number of requests (and probably response code) per API key. It would throw the aggregated stats into the database above.

Blocking Cron

A separate periodic cron would run to pick out API keys that are over their limits and add them to a list of blocked API keys. Normally I'd serve this on a static file store like S3, but perhaps it could be a well-cached API endpoint.

Access Denied

Finally, the edge caches would need to have some piece as close to the beginning of the request cycle as possible that would block the request if the API key didn't exist, was on the list of blocked API keys, or whose Referrer header wasn't on the allowed list for the API key. This piece would keep the list of blocked API keys in memory and update itself every few minutes from the above service.


We'd start by not requiring an API key. Once this system is up and we're happy with it (maybe by testing privately and then with a key for OSM.org) we could announce the change, work with major (friendly) users to get them transitioned, and then several months later turn off anonymous tile access.

tomhughes commented 4 years ago

The fundamental problem is that it requires a distributed database accessible and updatable by 40+ tile caches.

Then there's the question of how much extra CPU and I/O overhead will be involved in validating each request and in updating access counters and how much that will reduce our total tile serving capacity by.

I know you're going to say that is not what you are proposing, but I don't think that a scheme that relies on log processing is workable - it would take in excess of 24 hours to block and unblock keys as things stand and it's not clear to me that there is an easy path to reducing that.

There is also the problem of dealing with the inevitable support load that it will produce - we already effectively ignore the vast majority of enquiries related to tile serving and API keys will only increase that workload.

After all of that will it even work? I don't have any experience of running these things but I would assume that the API keys leak like crazy... Especially the OSM one which will be the obvious one to "borrow" and which will presumably be unlimited.

pnorman commented 4 years ago

I'd like to see API keys, but we need plans for handling the engineering, operations, and support load from it.

lonvia commented 4 years ago

If we go for API keys, then the Nominatim API should get them as well.

Personally, like @tomhughes, I dread the additional engineering and support work load. Happy to be proven wrong though.

grischard commented 4 years ago

API Keys could be limited to a certain referer or user agent?

freyfogle commented 4 years ago

Speaking as someone who runs a service with key based authentication I can tell you there are many people who will gladly register 10, 20, 200 accounts to get more keys and not have to spend money using a commercial service. People will write software to get the keys for them (ie register fake accounts), or get large groups of people (entire university class, etc) to register.

Requiring keys is certainly more of a barrier than nothing, and it at least can prevent accidental high usage from well-intended users, but it alone does not solve the problem of abusive users.

mikelmaron commented 4 years ago

@freyfogle definitely an issue, but I wonder if we can estimate how much of an issue deliberate abuse is for OSM. My vague impression from chatting about it before is that there are several high level users who are well-intended. Also if there is abuse, then that can lead to domain level blocks.

@grischard API keys could be limited like that, all depends on implementation.

The fundamental problem is that it requires a distributed database accessible and updatable by 40+ tile caches.

@tomhughes makes a key point. I'm asking around how this is managed.

migurski commented 4 years ago

Free tiles are like banner ads for OSM. Iā€™d be much more interested to see capacity raised and access preserved than limits put in place. Is there any background reading on needs for tile auth that you could link here to give this idea a little more context?

iandees commented 4 years ago

The fundamental problem is that it requires a distributed database accessible and updatable by 40+ tile caches.

Yep, there will need to be some way of aggregating the access counts. If we don't want to do log aggregation and count from access logs, we could add the counting to the blocking service. It could aggregate and periodically write to the database.

Then there's the question of how much extra CPU and I/O overhead will be involved in validating each request and in updating access counters and how much that will reduce our total tile serving capacity by.

This is a valid concern, but it seems that there's plenty of spare CPU on the edge caches right now. The API key check would boil down to parsing the HTTP request text (to get query string + headers) and an in-memory map lookup to check if the API key is blocked.

If we aggregate on the edge cache, we'd have to store several integers per API key seen in the last N minutes by this edge, but unless we have millions of different API keys per edge I doubt it'll be a problem. If it takes up too much memory we can increase the frequency of the aggregate + dump step.

There is also the problem of dealing with the inevitable support load that it will produce - we already effectively ignore the vast majority of enquiries related to tile serving and API keys will only increase that workload.

I think you could continue to ignore the vast majority of support requests. In fact an API key self-service page might reduce support requests because there will be a process for how to use the tiles. Perhaps when you create an account or an API key you have to agree to the tile usage policy, helping clear up some questions? Additionally, we could give more people the ability to adjust limits through the API Key tool to alleviate support request burden from just a few people.

I would assume that the API keys leak like crazy... Especially the OSM one which will be the obvious one to "borrow" and which will presumably be unlimited.

The OSM one would be limited to OSM referrers and could be rotated on a periodic/automatic basis. Tile scrapers and other bad actors will surely find a way around it, but my assumption is that the majority of unwanted traffic comes from people setting up Leaflet and pointing to tile.osm.org for lack of a free alternative.

if we go for API keys, then the Nominatim API should get them as well

Yea, good point. The API Key check could sit in front of any HTTP service and act as a reverse proxy to anything that handles HTTP requests. It could be configured at startup time with the name of the service its limiting so that aggregation in the central DB happens on a per-service basis.

API Keys could be limited to a certain referer or user agent?

Yep, the referrer, origin, and user-agent headers are all sent during the request and could be checked at request time.

people who will gladly register 10, 20, 200 accounts to get more keys and not have to spend money using a commercial service

Yea, this is something to think about. If this becomes a widespread problem, there are various levers we could use to make it a little harder by switching to login with an account that takes a little more work to create (Google, GitHub, etc.), requiring phone number verification, etc.

tomhughes commented 4 years ago

Well I assume in this case it's that @iandees as an American gets poor service because we have very limited capacity in North America especially since we lost one of our caches there a few weeks ago.

iandees commented 4 years ago

an American gets poor service because we have very limited capacity in North America

šŸ˜„ Actually it's been noticeably faster after Grant switched over to Nginx caching.

Free tiles are like banner ads for OSM. Iā€™d be much more interested to see capacity raised and access preserved than limits put in place. Is there any background reading on needs for tile auth that you could link here to give this idea a little more context?

I agree that free tiles is a great advertisement for OSM. Maybe I'll open another ticket about increasing capacity (especially in the US), but my feeling is that we're at the limit of what we can support in a purely volunteer time/donated hardware scenario.

I'm not aware of any prior reading on this topic. I think we're creating it right now! šŸŽ‰

tomhughes commented 4 years ago

Then there's the question of how much extra CPU and I/O overhead will be involved in validating each request and in updating access counters and how much that will reduce our total tile serving capacity by.

This is a valid concern, but it seems that there's plenty of spare CPU on the edge caches right now. The API key check would boil down to parsing the HTTP request text (to get query string + headers) and an in-memory map lookup to check if the API key is blocked.

If you think there's lots of spare CPU capacity then you're clearly being selective in which caches you look at - it varies massively. It's also changed recently since we have in the last week gone multithreaded on squid which has increased the CPU usage most likely.

There is also the problem of dealing with the inevitable support load that it will produce - we already effectively ignore the vast majority of enquiries related to tile serving and API keys will only increase that workload.

I think you could continue to ignore the vast majority of support requests. In fact an API key self-service page might reduce support requests because there will be a process for how to use the tiles. Perhaps when you create an account or an API key you have to agree to the tile usage policy, helping clear up some questions? Additionally, we could give more people the ability to adjust limits through the API Key tool to alleviate support request burden from just a few people.

There will be a whole new class of requests is my concern, namely begging emails asking for capacity increases, expedited releases of blocks, etc, etc. It will all be "you're ruining my mapping party" or "people might die" or...

I would assume that the API keys leak like crazy... Especially the OSM one which will be the obvious one to "borrow" and which will presumably be unlimited.

The OSM one would be limited to OSM referrers and could be rotated on a periodic/automatic basis. Tile scrapers and other bad actors will surely find a way around it, but my assumption is that the majority of unwanted traffic comes from people setting up Leaflet and pointing to tile.osm.org for lack of a free alternative.

I'm not sure referer checks are very useful - everybody fakes osm.org as the referer already.

API Keys could be limited to a certain referer or user agent?

Yep, the referrer, origin, and user-agent headers are all sent during the request and could be checked at request time.

Many of the requests have no, or very poor, referer and/or user agent information even though that's already against policy.

In particular mobile app traffic will rarely have any referer and often have very pool user agent information.

people who will gladly register 10, 20, 200 accounts to get more keys and not have to spend money using a commercial service

Yea, this is something to think about. If this becomes a widespread problem, there are various levers we could use to make it a little harder by switching to login with an account that takes a little more work to create (Google, GitHub, etc.), requiring phone number verification, etc.

I don't think requiring a third party account would go down very well and I have no idea how you expect us to do phone number verification even if we thought it was as reasonable thing to require.

In any case gmail accounts are like the one thing every spammer has by the thousand...

simonpoole commented 4 years ago

Sigh, broken record mode:

First the OSMF needs to decide what the intended audience is for the tiles and if that needs expanding or reducing (hint the load is not caused by people editing), what kind of service level they want to provide and how much its allowed to cost.

Then we can decide if API keys are a suitable and efficient way to achieve whatever the goal is.

PS: @zerebubuth had some in depth stats on tile usage way back, iirc the conclusion was that while getting rid of the large users was relatively easy and would lead to short term relieve in the end the long tail is getting to more and more of a burden (and that was before google recent price hike).

iandees commented 4 years ago

Sigh, broken record mode

Instead of an exasperated response could you point out where previous discussion on this topic has happened?

I don't really want to get into the politics, but I disagree that OSMF needs to decide anything new. There is already a tile usage policy and this would be a way to enforce that.

simonpoole commented 4 years ago

There are discussions on the topic spread everywhere, for example Darafei Praliaskouski (@Kompza) had expanding tile server capacity as part of his election platform when he stood for the board.

hbogner commented 4 years ago

As a provider of 2 cache and 1 render(osmf provided ssd drives) servers I have mixed feelings about implementing API keys. Those servers were provided so OSM could be used by anyone, but also I understand the load on operations team to handle all the maintenance. @pnorman asked me a similar question at SotM 2019 and I still have no response what I would feel about dropping those CDN servers, or locking them behind API keys.

zerebubuth commented 4 years ago

PS: @zerebubuth had some in depth stats on tile usage way back, iirc the conclusion was that while getting rid of the large users was relatively easy and would lead to short term relieve in the end the long tail is getting to mode and more of a burden (and tha was before google recent price hike).

My recollection is that, while there are usually a few heavy (ab)users to ban at any one moment in time, the list changes day-to-day and week-to-week. Keeping on top of that is a considerable effort. Basically, it's a game of whack-a-mole (see #113).

It seems to me that there are two separate issues here:

  1. Total available resources: I think the reason we're talking about this issue today is that, as Tom mentioned earlier. we're currently under-resourced in one region (and perhaps others, but if so they're not shouting as loud). The implication is that, if we had enough resources to run a good enough service, we wouldn't really care about access controls or being the world's tile server.
  2. Allocation of scarce resources: However, given that no one has come forward (cf. plea in #335) to donate a tile cache, we are left with the option of trying to direct the scarce resources we have towards deserving users (e.g: OSM contributors, visitors to osm.org, etc...).

While I think the latter would help, my reckoning is that it'd be at best a constant-factor improvement, and we'd be back here a little later when usage from deserving users eventually exceeds capacity. My personal view is that we need to be able to add capacity in regions where there aren't enough generous donors forthcoming, which means spending OSMF donors' money in a way that we haven't in the past (currently, all tile caches are donated).

Since this is a cost that could potentially grow without limit, I don't think it's unreasonable to ask the board what it thinks (although it might be worth constraining the options, so that we don't get something unworkable in response).

Having said that, I also think that API keys would be a good idea, if only to be able to more easily track and attribute usage to accounts, rather than the pile of hacky ack and sed that I've used in the past.

(As an aside, I'd be really interested in whether people who've run large services protected by API keys, and perhaps referrers, see a lot of API key cloning and referrer spoofing? People certainly spoof the User-Agent and Referer headers for requests to tile.osm.org - one of the reasons why any analysis of the logs should be taken with a large mountain of salt. I'd worry that any constant API key would suffer the same problem, but that varying the API key with user ID or source IP starts to make things very complicated.)

Finally, if anyone is reading this and would like to donate a tile cache in North America -- thank you! And please get in touch with operations@osmfoundation.org

simonpoole commented 4 years ago

...

I don't really want to get into the politics, but I disagree that OSMF needs to decide anything new. There is already a tile usage policy and this would be a way to enforce that.

The tile usage policy essentially only outlaws a not-defined (for good reason) "heavy use". As I pointed out @zerebubuth showed that the longer term issue (and this was 2-3 years back) is the long tail where none of the users is remarkable, so either we start lowering the volume that in our understanding is "heavy use" (something that a number of people don't want for marketing reasons) or we try to do something else. But having a crisis every 6 months doesn't make any sense at all.

simonpoole commented 4 years ago

...

Since this is a cost that could potentially grow without limit, I don't think it's unreasonable to ask the board what it thinks (although it might be worth constraining the options, so that we don't get something unworkable in response).

Given that there are other considerations, for example competing with commercial providers, which will become much more critical once we start providing vector tiles. I can't see how muddling along without a clear plan has any life left (I'm naturally not suggesting the the board should decide this in isolation, that's just a way of saying that the project needs to make its mind up, facilitated by the board).

Firefishy commented 4 years ago

One of the issues I have at the moment is there is no realistic method for me to contact a Heavy User once I've grep/awked enough to identified them.

eg: Let say I find example.com is a heavy user, who do I contact? At the moment the best I can do is try find a reasonable contact at example.com and email them. Apps are even more difficult especially if they have a poor or faked user-agent.

freyfogle commented 4 years ago

(As an aside, I'd be really interested in whether people who've run large services protected by API keys, and perhaps referrers, see a lot of API key cloning and referrer spoofing? People certainly spoof the User-Agent and Referer headers for requests to tile.osm.org - one of the reasons why any analysis of the logs should be taken with a large mountain of salt. I'd worry that any constant API key would suffer the same problem, but that varying the API key with user ID or source IP starts to make things very complicated.)

Hi, I can only speak for our geocoding service, which is obviously a different use case than tiles - not least as it is usually not run as a publicly-visible service. We have seen stealing of keys, but the bigger issue is people registering many (at times hundreds) accounts to get the free usage tier on each. These efforts range from basic and easily detected (people manually registering username+1@domain.com, username+2@domain.com) to people using software to register hundreds of accounts each from different IPs. Often they use Google or Github single sign on, they have pools of hundreds of addresses. user-agent is useless, that is trivial to fake. And with the increasing popularity and ease of use of serverless frameworks it also becomes easier and affordable for someone to have a large, and constantly changing, pool of IP addresses at their disposal. A final point of frustration is that many people set up a service to blast away, and then take months to notice that their account was blocked. They have some script hammering away at us and never check on it. I can totally understand @Firefishy when he says he needs someone to contact, and that will help in the case where the overuse is accidental, but industrialized spammers are definitely not checking the hotmail address they used to register and get the key with.

It is an endless arms race without simple solutions, and I have to admit at times the never-ending nature of the battle becomes a bit demoralizing.

Nakaner commented 4 years ago

Conflict of interest disclaimer: I am an employee of a company selling tiles using almost the same map style as tile.openstreetmap.org and tile.openstreetmap.de (tile.openstreetmap.de does not permit almost all commercial use).

@simonpoole wrote:

There are discussions on the topic spread everywhere, for example Darafei Praliaskouski (@Kompza) had expanding tile server capacity as part of his election platform when he stood for the board.

A long discussion how a new tile server usage policy should look like can be found at https://github.com/openstreetmap/operations/issues/113 without any result/changes but at least two fundamentally opposing opinions.

@Firefishy wrote:

One of the issues I have at the moment is there is no realistic method for me to contact a Heavy User once I've grep/awked enough to identified them.

eg: Let say I find example.com is a heavy user, who do I contact? At the moment the best I can do is try find a reasonable contact at example.com and email them. Apps are even more difficult especially if they have a poor or faked user-agent.

The Tile Usage Policy could require contact info on any website using the tiles. In Germany, almost all websites are required to provide an "imprint" (German "Impressum") at an easy to find location (i.e. any subpage has to link to it using "Impressum" as link text but the font size might be small and the link hidden in the footer). The imprint has to provide a postal address and email. A postal address would not help us really but requiring an easy to find email address could be a sensible requirement of the guideline (and banning if that is missing or difficult to find could be an option).

@mmd-osm wrote:

Let's bake some ads for "free osm tile services provided by OSMF" right into the PNG tiles, where they are impossible to remove for normal users and ad blockers so we can reach out to 100% of our tile users.

There are two ways of using tiles for advertisment: overlaying a text on a tile or serving a replacement tile like the black German anti EU upload filter map tiles this year. The first requires additional resources (calling an image library such as libgd), the second one fully blocks the view onto the map. The second one could reduce the load on the server because people migrate towards other sources (i.e. we move the load onto other free sources :-(). If we consider any of these options, we should discuss them in a separate ticket.

That's ads for our own free services only, with some hint to donate to OSMF to keep the service up and running. Saves us a lot of time which we would have to spend on setting up some API key scheme.

I presume that both overlays and pure advertisement tiles are easier to implement than a API key solution which interacts with all CDN nodes.

simonpoole commented 4 years ago

..

A long discussion how a new tile server usage policy should look like can be found at #113 without any result/changes but at least two fundamentally opposing opinions.

Thanks Michael, that's the thread with the numbers @zerebubuth and myself were referring to.

kocio-pl commented 4 years ago

Although we run some campaigns once in a while, those ads are only visible to osm.org users without ad blocker. We have to change this.

We could mitigate the issue using "acceptable ads" idea:

https://adblockplus.org/en/acceptable-ads

It means we should talk directly with different adblockers, meet their criteria and still the users are free to disable showing these ads too, but this is at least realistic way to do something, since there is only few big adblockers probably and IIRC this feature (showing acceptable ads) is on by default.

kocio-pl commented 4 years ago

I see. However that is a low hanging fruit, I guess.

pnorman commented 4 years ago

Adding an overlay to tiles seems a bit far afield given the current situation.

Scanning the thread again, there seems to be support from sysadmins and ops for API keys but concern over increased support load.

karussell commented 4 years ago

I think API keys have several advantages like "usage fairness" and also that users think twice before the implement a heavy requesting bot or something.

But API keys introduce a lot of complexity. You do not necessarily need a distributed database, but at least a well performing database that is not required without API keys. Additionally you need an in-memory cache that syncs frequently from this database and an async queue that feeds the database, to avoid hitting the database for every request. Of course, this depends a bit which performance requirements you have and how you want to scale your database cluster.

Another problem is the required Email registration and validation cycle, but I guess this is done via the existing OSM account, so no additional burden.

@zerebubuth: While I think the latter would help, my reckoning is that it'd be at best a constant-factor improvement, and we'd be back here a little later when usage from deserving users eventually exceeds capacity.

This is a valid argument: what do you do if you have many users, a certain limit for everyone enforced but still not enough capacity? But e.g. from our own experience this "little later" is far in the future.

Having said that, I also think that API keys would be a good idea, if only to be able to more easily track and attribute usage to accounts, rather than the pile of hacky ack and sed that I've used in the past.

If this is a problem than API keys is not the solution. One solution is to introduce a main logging service (collecting logs from all servers) or at least a cronjob that does some stats per server and per day.

@mmd-osm: Black tiles or replacement tiles with some text look really broken, so that's not a feasible option here.

Why? Free resources will be always misused and should be limited so that people try hard to avoid hitting the limits. Or one could even introduce some capacity raise for an OSMF donation ;)

@Firefishy: One of the issues I have at the moment is there is no realistic method for me to contact a Heavy User once I've grep/awked enough to identified them.

When they don't care about heavy requests why should you invest time to find them? It should work the opposite: to unblock them - they should come to you. After all it is a nice & free service they are abusing.

@freyfogle: is people registering many (at times hundreds) accounts to get the free usage tier on each

We fight those people via a blacklist of temporary-Email providers and block creation of X new accounts if they come from the same IP within Y hours.

Stealing keys is a smaller issue and can be mainly solved via allowing the API key an additional limits regarding the IP or enforcing the "http referer".

My preference is still pro-"API keys" but be aware of the additional work and problems.

zerebubuth commented 4 years ago

Having said that, I also think that API keys would be a good idea, if only to be able to more easily track and attribute usage to accounts, rather than the pile of hacky ack and sed that I've used in the past.

If this is a problem than API keys is not the solution. One solution is to introduce a main logging service (collecting logs from all servers) or at least a cronjob that does some stats per server and per day.

We do collect logs from all the servers. However, the logs can only contain the information that's available to the servers; in this case, source IP, User-Agent and Referer. The latter two are easily faked by non-browsers (i.e: scrapers & mobile apps) and the former is easily bypassed by using Tor, VPNs, farms of cloud machines, etc... (For an example of the lengths some developers will go to, please see https://github.com/openstreetmap/chef/pull/78)

Therefore, while we do collect this information, the logs are mostly useless for the purposes of identifying blockable requests. The caches apply automatic rate limits to clients from the same IP, but this arguably causes as many problems (c.f. mapping parties on a NAT) as it solves.

API keys give us a fourth piece of information, which can be tied directly to a contact or account. Which simply pushes the problem "upstream" to the account level. (Where, again, it's very simple to hide source IP using Tor, VPNs, or just wardriving open WiFi.)

My preference is still pro-"API keys" but be aware of the additional work and problems.

I agree - the existing workflow of "read logs ā†’ identify bad actors ā†’ block them ā†’ repeat" hasn't been working, especially as the "long tail" of large numbers of smaller sites/users continues to grow. Any new workflow would have to be more automated, or fully automatic, to be practical.

pnorman commented 4 years ago

I've been wondering if we should write up a doc and look for a contractor for this. The lack of any meaningful by-user breakdown makes it very hard to plan and manage the service, and this is increasingly becoming an issue.

Marc-marc-marc commented 4 years ago

a automatic-cycling api token (each logged-in user receives a .html with the same current api key, the key changing every X min) works well to give full speed to the contributor and limited speed to others (and also gives people a reason to sign up, which could then make it easier for them to make their first contribution). it won't be effective enough to reach professional abusers, but it is a solution that has a fairly low cost (just check the current key and the last key)

tomhughes commented 4 years ago

Like we already have you mean (it's just in a cookie, not the url).

pnorman commented 2 years ago

This ops team discussed this a couple meetings ago.

Although we are interested in the idea in principle, the tile CDN changes have changed a few things since when it was written

Given this is not causing any particular pain points right now, we don't feel it would be worth the effort.

If someone still wants to write this themselves to scratch a technical itch, they can get in touch.