openstreetmap / openstreetmap-website

The Rails application that powers OpenStreetMap
https://www.openstreetmap.org/
GNU General Public License v2.0
2.21k stars 918 forks source link

API for user blocks #1618

Open willemarcel opened 7 years ago

willemarcel commented 7 years ago

In OSMCha, we are adding a suspicion reason "User has multiple blocks" to changesets whose creator has more than 1 block. I believe it would be better if we consider the time distance from the last block and the number of changesets the user has. That way, we wouldn't flag changesets of users that were blocked many months ago, created a lot of changesets and probably learned the good practices.

For this we would need an API to user_blocks. Do the maintainers have something against having a endpoint to list general user_blocks and list user_blocks of a specific user? I can work on this if you consider we can have those endpoints.

simonpoole commented 7 years ago

When I suggested adding this to Osmcha I was aware of the issue and thought s bit about how to resolve it.

The easy, low impact variant is to add a last_block attribute with the date the block expired. This can be done without introducing a new API call.

willemarcel commented 7 years ago

@simonpoole Yes, it could solve as well.

woodpeck commented 7 years ago

User blocks do not necessarily have anything to do with the user making bad edits. Someone could have made bad edits three weeks ago which need to be reverted but they won't get a block since the issue doesn't call for it. Another person could be making the same edits today and we block them in order to stop them. A third person could be making totally legitimate edits but due to a misconfiguration not see emails generated by changeset comments, and receive a block because that's the only way to get their attention.

I wouldn't like seeing blocks used to treat an user as "more suspicious" because this would, in turn, lead to calls for removing past block information from users who have "done nothing wrong"...

willemarcel commented 7 years ago

@woodpeck I agree almost completely with you. We are adding information to the changesets to help other users to prioritize which edits should receive attention and be reviewed. We removed the word "suspicion" from the frontend in the hope that the users don't see it as an accusation.

We only flag users that have more than one block, so if someone had a misconfiguration in his email address and received only one block, it'll not affect him. The third case you mentioned shouldn't really be treated as more "suspicious", although I believe this block case in an exception.

Moreover, I think you'll agree that communication is an important aspect in a collaborative project and users must care about have a correct email configured and answer other contributors questions.

woodpeck commented 7 years ago

We are currently exposing that information already, just on a web site targeted at humans; from a data protection viewpoint an API would not make a difference.

pnorman commented 7 years ago

I'm slightly in favour of adding an API, because there are useful use cases for the data. I don't consider what OSMcha wants to do as one of them, but that's a discussion for the osmcha issue tracker.

gravitystorm commented 7 years ago

I'm happy to add an API, but I first want to check that a data dump isn't more appropriate. I'd like to avoid services making multiple API requests for user_block information (e.g. one request per changeset analysed) since this doesn't scale well. If the information is in a file that's much easier to scale than an API call.

Are there any existing dumps on http://planet.openstreetmap.org/ where adding user_block information would be appropriate? Or would we need a brand new dump?

simonpoole commented 7 years ago

@gravitystorm osmcha and probably every other use case that needs to access any user information already hits the user endpoint (at least once), example http://api.openstreetmap.org/api/0.6/user/100950 I suspect referring to a dump of the data (which currently doesn't exist) is not going to be very popular (not to mention premature optimisation etc).

Even if my suggestion of simply adding a last block expired timestamp attribute is considered too hackish, a -separate- API doesn't seem to make sense when additional information can simply be added to the already existing blocks section in the user API (even in the worst case we are not talking about more than a handful of entries).

gravitystorm commented 7 years ago

every other use case that needs to access any user information already hits the user endpoint (at least once)

That's not a good thing. What information are we not putting into the various planet/history/changeset etc dumps?

not to mention premature optimisation etc

It's an important design philosophy. Making access available to all the OSM data via dumps is in stark contrast to many other "open data" projects where you need to scrape an API to get the whole data set.

I can understand that it is often easier to use an API than to process dumps, but if scraping our API is the only way to get some particular information then I think it's important for us to fix.

simonpoole commented 7 years ago

1st wrt data not present in other places: account_created,, contributor-terms-agreed, role, changesets count, traces count, number of blocks and the gravatar link. Some of this can naturally be calculated from changeset dumps and the like.

2nd the design philosophy clearly applies to OSM geo data, we've never extended that to user data and there are some arguments why it would not be the same (this goes back to the licence change). But regardless of that, in the context of the suggestions of completely removing user information from dumps accessible to the general public (see @woodpeck discussions on the topic), it would seem to be unwise to extend that specific can of worms any more before that is decided (for example because of the right to be forgotten DP rules).

gravitystorm commented 7 years ago

@simonpoole could you link to these "woodpeck discussions" please, and clarify who is going to decide what, and on what timescale. Your last paragraph is a bit too vague, and sort of throws cold water over any development on this topic!

simonpoole commented 7 years ago

Timescale: I would expect if there are any changes they are likely to happen by mid next year (GPDR introduction) .

Not saying that there will be drastic change, just that there has been a wide range of proposals from completely stripping all information that allows association with specific individuals from the data we provide publicly (changesets, uid, display_names, timestamps and so on), to doing nothing at all.

The LWG has funds allocated for investigating the legal obligations side of things, which is a bit independent of what @woodpeck has been doing (which goes back iirc to a BoF at this years FOSSGIS conference). I've asked him to comment directly.

In the end however my point was just that it is simply far easier to control information we supply via an API, than data that we provide as dumps. The OSMF is definitely not obliged to do that (provide dumps) with user meta data and if it would be done, we would need to agree on what terms that would happen.

woodpeck commented 7 years ago

I'm not sure this needs to have any bearing on what we are discussing here. The gist of what @simonpoole refers to as "woodpeck discussions" is, in one sentence: We might at some point in time want, or even have to, restrict the display and distribution of contributor user names to people who have "signed" that they will only use this data for OSM-related purposes. This would, for example, mean that the OSM API would only return data with user names to logged-in users (who, as a condition of signup, have agreed that they will only use them for OSM purposes), and it would mean that the normal, unrestricted, public planet file would not contain user names (while a second one that you could download after logging in or clicking some "I agree" button or so would).

This would not protect our contributors from being stalked through OSM data (after all, you can see from our data who was awake and at the computer at what time, sometimes even who was in a particular area at a particular time), but it would at least make such stalking violate our rules (while currently it totally fine with us to milk the planet file for all it's got about a particular individual and put that on a web site, duly ODbL'd). It would set a clear sign from the project that this data is only meant (and necessary, think QA) for OSM project purposes, and make us less vulnerable to data protection complaints.

(Unlike with the geodata itself, we never promised anyone that we would distribute the metadata under ODbL so license-wise this could be done.)

This discussion originated in a data protection BoF at FOSSGIS in Passau, and has motivated Pascal Neis to make his HDYC tool "login only" (after he had also received, unrelated, legal threats from mappers who were surprised to see what they considered personal data published openly). The matter was discussed in English on the talk list here: https://lists.openstreetmap.org/pipermail/talk/2017-May/077940.html

I think it is an interesting and important topic, but IMHO it doesn't belong here; if we should ever decide to limit username information to logged-in users, we'll simply have to revisit all API calls and modify them accordingly, and it won't make a difference if there's one more or less.