Closed woodpeck closed 11 months ago
The production API is in the cgimap repository, not here.
Closed in favour of https://github.com/zerebubuth/openstreetmap-cgimap/issues/207
From a practical point of view you need to define "edit" somehow - the obvious answer is changesets but that can be easily gamed so it will presumably need to be the number of objects added/changed/deleted.
Doing it per day is probably excessively expensive from an implementation standpoint because it requires resetting counters at some point, or keeping say hourly counts and expiring those over a day old. A token bucket type algorithm like the download limit probably makes more sense.
From a personal point of view I dread dealing with the resulting complaints...
Actually I think the issue should remain here :-).
At least the obvious, how stupid me would do it, implementation would involve automatically creating a block once the limit has been exceeded (because all the bits and pieces are already there and we don't want to create a slightly different implementation of the same thing).
I believe this would address @tomhughes concerns to as the messaging can be clear what to do to be unblocked (OK I believe we still have the issue with iD borking when an account has been blocked but iD users are unlikely to run in to this anyway).
It doesn't matter how clear you make the messaging, you will get people pleading for themselves to be special cased because, you know, PEOPLE MIGHT DIE.
Just to add a bit of detail here, we already have rate limiting on the changeset upload call on cgimap, which is currently based on the number of bytes returnes by the call (like it is done for the /map call). Once you hit the limit, the upload would be rejected via HTTP 509 Bandwidth Limit Exceeded. It's not a permanent block either, so it wouldn't place too much of a burden on sysadmins.
Admittedly, looking at the number of bytes is not effective by any means to stop the users mentioned here (main reason being that the diffResult message usually comes with a rather tiny amount of bytes compared to a /map call). ;aybe we could evolve this approach by defining some more meaningful criteria.
What I was proposing was some sort of equivalent limit on the upload side, whether that's just bytes uploaded, or number of objects changed. But implemented in basically the same way.
we already have rate limiting on the changeset upload call on cgimap
Where does cgimap store the rate limit status? Is it in the database somewhere, or does each instance of cgimap (i.e. on different backend servers) keep track of the rates separately?
The production API is in the cgimap repository, not here.
I think it's worth discussing it here too, particularly if we want to ensure clear messaging to the user when they go over any quota.
From a personal point of view I dread dealing with the resulting complaints...
So let's make sure the complaints go to someone else! I think it would be something for DWG to deal with, rather than the sysadmin group.
They're stored in memcache, so shared across instances but can be lost if a memcache server restarts.
I think it's worth discussing it here too, particularly if we want to ensure clear messaging to the user when they go over any quota.
For a clear messaging we'd need something better than You have downloaded too much data. Please try again later. with comes with an HTTP 509 Bandwidth Limit Exceeded error response.
Also, both downloads and uploads currently share the same user id-based key in memcache. A user downloading lots of data might be impacted when trying to upload some changes (and vice versa). I don't know if this is good idea. On the other hand, if memcache memory permits, we might as well introduce a dedicated key that is only being used for changeset upload tracking.
Maybe, we also need to include the "create changeset" call, to keep people from say creating thousands of changesets in a very short amount of time.
It could be something that users can override but not accidentally - for example, you could be normally limited to X edits per day (exact numbers t.b.d.) and then you could click a button in your user preferences that says "I have read the data import and mechanical edit guidelines and I want to lift the limitation for one week" or so.
I still don't have a clear picture what makes sense here. Checking the users table for some timestamp when rate limiting restrictions would be back to normal, isn't impossible to implement. Pretty much the same applies to the user registration date or number of changesets as additional decision criteria. It's mostly a matter of how a meaningful policy could look like, and also this requires some consensus building in the community.
It is certainly good to air the idea with the community a bit, though when it comes to it, I would prefer to implement a technologically viable mechanism (that doesn't make the source more complex than necessary and doesn't waste to many resources when executed) and then have the community participate in parametrising that, rather than attempt to find consensus on how exactly something should be implemented.
We're not writing a bill of rights here; I think it will be easy to find consensus on the bare outline that I specified initially (make it so that we can't have someone sign up and upload 200k buildings in two days before anybody even notices), and anything more detailed could well be entrusted to those who actually write code ;)
then you could click a button in your user preferences that says "I have read the data import and mechanical edit guidelines and I want to lift the limitation for one week" or so.
You could argue that large scale imports come with some kind of responsibility, perhaps even require a commitment to the project. Instead of people clicking on a button, you might be thinking about whether this is something where an OSMF membership status could come into play.
2012: your plugin for preventing people spending all their time on mapping is harmful. 2019: okay we need to limit mappers' activity.
@Zverik : your "no more mapping" plugin was a prank to kill JOSM altogether. I don't see how this relates to this discussion. Seems rather off topic to me.
It was as much of a prank as "delete the repository" button in github. And it relates to the discussion: instead of questioning why should we make arbitrary limits on uploading and why DWG cannot process or even find these super-productive users, we are discussing how to block people from uploading data to OSM automatically. First by number of changesets, then by user-agent, then by geoip and so on — temporary measures completely opaque and puzzling to users, with no warnings, unlike that plugin.
couldn’t you just create another account if you were to hit an upload limit? IP addresses are often dynamically assigned, user agents can be customized at will, VPNs are cheap, what would prevent someone from circumventing arbitrary limits anyway?
Yes obviously it's not going to stop somebody that is aware what they're doing is wrong but is determined to go ahead, but that's not the goal, because no technical solution could.
The goal is make the well intentioned but over enthusiastic people stop, think, and learn about the community rules around imports and automated edits.
The goal is make the well intentioned but over enthusiastic people stop, think, and learn about the community rules around imports and automated edits.
Or, turning this around, today if someone uploads 200,000 buildings in one day without any prior consultation then it could have been an accident. I want to come to a point where if someone does this, it is obvious that it was done with ill intent.
While editors can, and should, inform their users about potential issues, I think it would also be worth contemplating to have some sort of rate limit on the API.
Come to think of it, we should really focus a bit more on UX and discuss ways on how to guide (new) users without patronizing them too much.
In particular, large scale uploads, or massive moving around of nodes (based on some well meaning clean up effort) could be easily detected in JOSM and combined with other user metrics like number of changesets or date of registration. Also, JOSM knows how much time a user spent in an editing session and can figure out, if it's an "Overpass download, search and replace and upload within 1 minute"-kind of mechanical upload, which the API would never be able to figure out.
On top, we can present fully localized messages to the user, along with a nice graphical step-by-step guide to go through some checklist or similar, rather than some technical HTTP error message the API would return. Don't underestimate the impact of those localized messages, there are lots of people that aren't English native speakers, quite a few probably don't understand any English at all. The API message would simply cause some annoyance but wouldn't help at all figuring out what the issue is.
So again, let's focus on the user experience first, think how the user interaction could look like, and what kind of guidance would be reasonable and acceptable.
Are you sure this is the right place to design the JOSM user interface? Is that not something that should be left to the JOSM developers? There's a relationship of course, but I think that we should do this like we did it with the limitation of changeset sizes: The OSM API decides what is acceptable and rejects anything unacceptable with an error message, and it is then the duty of the editors to ideally guide their users so that this error condition is never triggered, or to deal with the error in a suitable fashion should it occur.
Since I'm a JOSM user myself I could certainly sketch a number of possible JOSM UI approaches to get this right, but they would not necessarily apply to other editors that have different UIs - I think that approaching this from a UI perspective is the wrong approach for us in this repository.
We should build a mechanism that lets us reject uploads of "too much in too short time", and we could initially set these limits very high and give editor writes the chance to experiment and adapt their UI if necessary, and then tighten the limits to something sensible. I think that trying to design good UIs for all editors is out of scope here.
I don't to go into details about the differences between UX and UI design, as it would be a bit beyond the scope of this issue. I'm 100% positive that you need to think this issue end-to-end starting from the editing app to the API to get some meaningful results, rather than just annoy users with arbitrary limits. This is mainly a JOSM issue. I clearly don't see any of the other editors anywhere close to creating a similar amount of data to upload (ignoring command line tools for a second).
From the API side, we can rather easily calculate both the number of changesets a user created during the last "n" hours, as well as as well as the total number of changes via some database statement. Number of changes would include any kind of operation (creating/updating/deleting) without knowing any further details.
Those figures should be checked in addition at the time when a new changeset is being created, rather than only during the upload itself (I think the reasons are clear). A meaningful response should of course include the estimated waiting time, otherwise the error message won't make much sense to a user. Every endpoint subject to limitations should also return the remaining quota in each call. (some ideas: https://stackoverflow.com/questions/16022624/examples-of-http-api-rate-limiting-http-response-headers)
I would consider anything that requires any sort of domain knowledge (e.g. counting "number of houses") to be out of scope for the API. If a user uploads tons of objects with rather uncommon tags, the editor app should take care of that.
Still, the overall approach as it stands has a massive negative impact on people trying to clean up failed imports from other mappers or simply revert vandalism. I want to see a bit more discussion how those negative effects can be minimized, if not completely avoided.
I suggested in my initial message that there could be a mechanism whereby people could - perhaps temporarily - lift the limit in a self-service fashion. That would go a long way to ensure that if someone knows what they're doing they will not be hindered, but if someone does not know what they're doing they will be.
I agree that all the API can do is count the number of modifications, not what exactly something is, and I never intended it to; I mentioned houses just as an example. If someone uploads a hundred thousand trees (or, also a popular option, a hundred thousand untagged objects), or deletes a hundred thousand houses, that's just as bad.
(adding https://github.com/bryceco/GoMap/issues/113#issuecomment-384604159 as an example of an app creating 800 empty changesets in 3 minutes).
@mmd-osm, thank you for the link. This emphasises the need to have the limits in the API - even if we could get all editors to implement internal limits (plenty hard), a simple bug (perhaps in the limiting code itself...) could still cause a changeset storm.
Closed in favour of zerebubuth/openstreetmap-cgimap#207
@woodpeck This issue was closed, with request to reopen this one https://github.com/zerebubuth/openstreetmap-cgimap/issues/207#issuecomment-553039871
I have outlined in the cgimap issue, why this topic should be dealt with exclusively on the Rails side at this time:
Here's a recent case where someone uploaded 3 million nodes and 300k ways in one day without knowing what they were doing, all to be reverted again: https://www.openstreetmap.org/changeset/76868869 (and older).
Here is my suggestion. Implement a process whereby 2 or more users with a minimum account age can be petitioned. Upon acceptance by the minimum quorum, a token is generated for one-time use which is permitted to exceed the normal limits, which is attached to the upload. Or something along those lines. There could also be more granular limits. For instance, local user group members could be permitted to grant tokens for their local area. Larger changes could require a national user group token. I would be in favor of applying granular limits on a location basis to protect from vandalism.
Another suggestion is for large changesets to be placed in an approval queue, which again could be done on a local or national user group basis depending on the size of the change.
Approval queues have been suggested by a number of people in the past. I believe the concept would be very expensive to implement, it would cause quite some disruption, hence it is probably not feasible for the time being.
Besides, if there's noone to review changes in time, then those queued changed would become stale and conflict with other changes that have been done in the meantime. Good luck resolving those.
@Chaz6 These are actually very good suggestions, and very similar to what Waze does with their editor levels (advanced editing features are locked until you attain a certain editor score).
It's not something we can easily add to OSM, given the project's inertia, but if I were starting over from scratch I'd probably implement something similar from the get-go.
@bhousel : you could still do this entirely in iD, and only enable some features after a user passes the respective parts of the tutorial, or by using some other metrics. After some time, you could encourage a user to learn something new, and then enable that feature. It's a bit of an adaptive tutorial mode.
If you want to talk to me about iD, please PM me. I'm not really interested in starting a conversation about things we could add to iD, and this ticket isn't about iD anyway.
I was just responding to a person who I thought had some good suggestions and was getting some crappy feedback.. I'll unfollow this thread now.
Approval queues have been suggested by a number of people in the past. I believe the concept would be extremely expensive to implement, it would cause massive disruption all over, hence it is simply not feasible. Besides, there's noone to review changes in time, and then those queued changed become stale and conflict with other changes that have been done in the meantime. Good luck resolving those.
I completely disagree about review queues for changesets. We already have a need for send-and-poll for changeset uploads, as I've described in a blog post and discussed with various editor developers. If that was implemented, then there is an opportunity for changesets to be reviewed before being applied. I would intend for that to be default-approve, immediately at first, and then with longer windows (e.g. 1 minute) as we develop processes and capacity to handle them. We could have longer timeouts based on changeset size, user status, all kinds of things.
But unfortunately this is all pie in the sky while we have so little development capacity available. So let's label this as "future" and return to more achievable goals.
We could do with having some default limits for all kinds of things - number of diary entries per day, number of gpx uploads, number of notes closed per day, number of changesets per day. It's not sustainable having all of these as "completely unlimited". We should add some safety rails.
This isn't a new concept, we already have it for messages per hour, for example.
We already have a need for send-and-poll for changeset uploads, as I've described in a blog post
I think I disagree with this part of your blog. Uploading & storing the osmChange message and then polling for updates would take more time than simply waiting for the result for at most a few seconds, now that the changeset upload is no longer handled by Rails. This also works with compressed uploads on 2G mobile networks (GSM) - I tested this scenario successfully.
Maybe I should have written that upload queues are not feasible at this time, as there's really lots of work involved to make that happen. And here I agree again with you, this is all pie in the sky at this time.
I agree with bhousel (minus the unnecessary negativity towards the pace of development and the i-am-publicly-unfollowing-this-now-because-you-are-not-worthy-of-my-time antics): This is not something we should be adding to editors because every editor will implement their own concept of what users to trust with what level of edits, and script writers will just circumvent any such limitations anyway. It is something we should have on a basic API level, just as we limit the maximum number of changes in a changeset.
And with "it" I mean a basic rate limiting for changes uploaded at first, where users only make e.g. 10k edits a day. Or make that 100k, and later reduce the limit as we find the time to code exceptions like mentioned above, where e.g. a well planned import process can receive an override token or whatever.
Uhm, I thought Waze was locking down certain types of features, which I find very hard to do on the API level, hence my suggestion to look at it more from an editor point of view.
I think we agree that only considering the number of edits to be a viable option for the API.
WRT editor integration, repeating what I pointed out on the relation size limit issue, and elsewhere, any kind of limit/block/whatever that potentially will cause the contributor to loose their work, needs to be detectable by the editing application in advance. This will allow a app to prevent the user in a suitable form from continuing to edit, or at least point out the potential frustration.
This ranges from maximum upload limits as originally discussed, over changeset review queues (which for example would need to block further contributions from editors in the same area), to any kind of data model limits (relation member size limits for example).
Unless we have time and resources to implement something fancy, may I suggest to temporarily shelve anything that takes more than 2 weeks real time to get on osm.org?
I love overengineering things, and https://xkcd.com/974/ . But: a) There's a high risk that this will be heatedly debated 2, 3, 5 years from now. b) Something simple in place might prevent only a fraction of issues, but that's more than nothing. c) Something simple is easier to extend [if following a sane design].
For example, let's start with a hardcoded limit, set in the DB. Then it could be extended with per-user overrides, still directly in the DB. If somebody gets to it. Then some admin interface could be added. If somebody gets to it.
people — usually, but not always new signups — upload hundreds of thousands of objects to OSM before someone notices and tells them to stop.
[Emphasis mine.]
So make the limits conditional upon account age, then.
This would also discourage the mischievous from opening new accounts in order to circumvent restrictions.
we should really focus a bit more on UX and discuss ways on how to guide (new) users without patronizing them too much.
Definitely. See openstreetmap/iD#8590 (comment) & subsequent discussion.
Just a small reminder that blindly limiting by account age would negatively affect proper imports that create dedicated accounts. Solutions to that discussed earlier.
Given that proper import should document import (including account that will be used) and wait some time for feedback from community - proper imports will not be impacted.
This issue was mentioned in the community forum: https://community.openstreetmap.org/t/dwg-username-impersonation/101212
Preamble: I am aware that repository maintainers are aware of facts stated in this comment. This comment is attempt to summarize what was posted so far and propose some specific limits. I made it with hope that it will be useful.
So we have few types of edits:
For specific limits:
Costs here are additional complexity, also for editors that need to handle this (one it starts being developed it would make sense to ping at a least iD maintainer, likely also other developers). Mostly by providing new error message. The same goes for API libraries.
Is it viable for new legitimate user to edit more than 5000 elements and try to submit as their first edit and become stuck
Maybe it would be feasible to reduce limits by factor of 10 and count each tagless node as 1/100 of object? That would reduce vandal bit impact without risk of blocking new people mapping landuse.
Is it possible to assign greater penalty to empty edits? This is associated with buggy software and vandals. For example by counting opening changeset as 5, then reducing count by 4 once first data is send.
For future: rate limit exempt flag. Would it make sense to discuss with community/DWG/whoever relevant how it would be assigned? Or is it better to wait for at least initial implementation?
Would it make sense to include also IP-based rate limit, not only per account rate-limit?
Considered: button to temporarily bypass rate limits (complex, will help in small subset of cases), taking into account OSMF membership (not happy about this starting to give any privileges in mapping). Date of registration - reduce limits for new accounts (additional complexity, seems easy to work around it but may make sense).
(if proposal above is too long/too complex: let me know and I will try to make more compact one keeping the most important parts)
Initially, I thought this could be managed by the "changeset create" endpoint only, which is implemented both on CGImap and Rails (we're using the Rails version in production).
However, since we have apps like StreetComplete, we need to be careful to have a sane limit of currently opened changesets at any one time. If this limit is too low, it will break StreetComplete. Also, changesets might live up to 24 hours, if you keep on uploading small changes over a period of multiple hours, i.e. we need to be more precise what exactly "changesets/h" means.
Without applying additional changes to CGImap, I don't see how we could possibly limit the number of max. changes for each of these open changesets to less than 10k changes, or have a limit on the total number of changes per hour.
Bottom line, these changes are much more difficult to implement than it might appear.
IMHO the limit should be based on (total) number of objects changed not changesets.
Naturally this doesn't solve the related changeset issues, but I don't think that the changeset is "unexpectedly" closed is actually one of them, as all editors already need to be able to deal with that in any case (even if they don't explicitly close changesets).
What does need investigating (and this is something we can do now) is what is the most user friendly way to abort/refuse an upload in the current editor/API constellation. "user friendly" as in legit edits remain in the editor and are not lost.
"user friendly" as in legit edits remain in the editor and are not lost.
Good question, to which I don't have an answer right now. It might be something based on HTTP 409 or HTTP 412, and include an error message stating the number of minutes the user has to wait before retrying, and maybe the current limits that a relevant for this particular user. In any case, any editor currently applying some regex magic to extract meaningful data out of these error messages might face some issues here.
Besides, let's not forget that we need to cover the single object endpoints on Rails as well (next to the diff upload both Rails and CGImap). Also, proper database locking is required, to avoid bypassing the limits by running multiple uploads in parallel.
Maybe it would make sense to implement all this on Rails first, and test it on a new dev Rails instance. Once we are somewhat confident that it's ready for production, we'll do the port to CGImap.
Overall, I'd also like to mention that I didn't see a thorough analysis of potential attack vectors, still we're already discussing some detailed solutions here, which feels to me much like an ad-hoc attempt at fixing things. That's not exactly the kind of working model I'd like to see for security topics.
Bottom line, these changes are much more difficult to implement than it might appear.
If I would consider them simple then I would submit PR rather than writing hopefully-useful-comments :)
Overall, I'd also like to mention that I didn't see a thorough analysis of potential attack vectors, still we're already discussing some detailed solutions here, which feels to me much like an ad-hoc attempt at fixing things. That's not exactly the kind of working model I'd like to see for security topics.
Is it about some dedicated attacks like initiating multiple huge uploads at the same time, or opening multiple changesets and keeping them alive to create big stack and using them all at once?
Or about kind of attacks we want to stop (bulk vandalbot like latest attack, badly done import etc)
Os is it about exploiting this method to cripple regular user who is not malicious (false reporting etc)?
Or about something else?
It regularly happens that people - usually, but not always new signups - upload hundreds of thousands of objects to OSM before someone notices and tells them to stop. Then we have to delete those hundreds of thousands of objects again. (Case in point from recent past, https://www.openstreetmap.org/user/maxiangying.) This is undesirable:
While editors can, and should, inform their users about potential issues, I think it would also be worth contemplating to have some sort of rate limit on the API. It could be something that users can override but not accidentally - for example, you could be normally limited to X edits per day (exact numbers t.b.d.) and then you could click a button in your user preferences that says "I have read the data import and mechanical edit guidelines and I want to lift the limitation for one week" or so.