streetcomplete / StreetComplete

Easy to use OpenStreetMap editor for Android
https://streetcomplete.app
GNU General Public License v3.0
3.9k stars 357 forks source link

Synchronization of statistics (tie star count to OSM account) #1485

Closed matkoniecz closed 4 years ago

matkoniecz commented 5 years ago

Question to answer: how access to server with statistics would be authenticated?

Is it better to have central server or a serverless version (almost certainly - server, though it is necessary to answer the authentication question)?

Is there something that can be improved at specification stage? Is there something that should be handled and is missing?

It was previously reported as #188 - I opened new issue as it is the first step toward actually coding this feature.


Currently star count is the only statistic available and it is stored locally. As result changing phone or reinstalling application resets star count.

It would be nice to share star count between devices.

It is desirable to get and synchronize also other statistics, not only total star count. It would be nice to have info how many quests of each type were solved or star count limited to some specific area. This would allow properly adding badges/achievements based on this new statistics that would be shared across devices.

This data can be retrieved from OSM changeset history.


Serverless version

It can be done without a central server, but...

The problem is that it would be


Central server

An alternative is to introduce a central server. Individual phones would call it to get computed info.

I think that it may be a good way to speed-up data generation and reduce calls to OSM API.


Specification of the server

It would be written in PHP (due to inability to run other software on available hosting). It would use curl to communicate with OSM API (curl is confirmed to be available). There is also https://www.php.net/manual/en/function.file-get-contents.php (fopen is enabled).

Following was copied from email by @westnordost and published with a permission

The script will remember the date of the most recent changeset made by the user as of its knowledge. If not set, that would be the "date of birth" of StreetComplete - 20.02.2017. And then, on each request to update the data by the client made, walk its way towards "now". So, in this manner, the API will be lazy. Which is okay and does not affect the actuality of the data shown in the app, because the app itself will also keep track of every answered quest, so also stores some local version of the data stored on the server-side.

How calling OSM API would work (the same applies to a potential serverless version):

cascade of for-loop calls: Find all changesets by user, paginate through it because only 100 are shown at a time, for all changesets that are created_by StreetComplete, download the whole changeset (ugh...) and count the number of unique elements affected. Yes, I noticed the new "changes_count" attribute (see https://api.openstreetmap.org/api/0.6/changeset/70163079), but how are reverted changes counted? My guess is, a revert is counted as a change, so one element answered 10 times and reverted 10 times results in 20 "stars", not 0. Perhaps a new attribute could be added on the side of the OSM Api developers, so that each changeset does not need to be downloaded one by one, but only the meta information. The "ugh"-part is a strong reason not to do it in the client itself, because this will be a lot of traffic. The phone may be on mobile data connection.

Note: changeset in returned list are "ordered by creation date" https://wiki.openstreetmap.org/wiki/API_v0.6

Simple case of revert in changeset (not spread over separate changesets): https://api.openstreetmap.org/api/0.6/changeset/72331586/download

API calls for the new server

  1. GET just my stars (for quest counter)
  2. GET number of stars by quest type. Returns map of quest type name -> count) (3. GET number of stars by country code) (4. maybe more ...)

I think that responses should also include "as of" date.

The tables could be something like

  1. users with columns: user id (primary key), last updated date
  2. quests with columns: user id, quest type, count (3. countries with columns: user id, country code, count) (4. maybe more ...)

I am thinking that storing more raw data may be preferable.

1) SC changesets: changeset id (primary key), user id, quest type, star count earned in it, bbox 2) users: user id (primary key), last updated date (or last processed changesets)

I think that with such format it will be easier debug what went wrong in case of data corruption, it will be easier to add new queries. For example "is there a week with at least one quest solved on each day" (regular repeated use may be a better metric than just edit count).

Not sure whatever it will be useful to add indexes/tables with data served as responses.

There also should be a database on a phone - with reserved stars and date when this happened. This way it will be possible to merge this data with responses "user had $X stars as of $DATE"

mmd-osm commented 5 years ago

This has been suggested multiple times in the other issue, but obviously nobody took any notice of it: my suggestion would also be to use the https://wiki.openstreetmap.org/wiki/API_v0.6#Preferences_of_the_logged-in_user service, like overpass turbo does, and you're done. The only thing you can do is to cheat on yourself, which is kind of pointless anyway. This information is visible for your own user, regardless from where you log on.

So there's really no need for any local server, or no additions to the API ("Perhaps a new attribute could be added on the side of the OSM Api developers," is not going to happen, sorry).

I've seen the comment above "but it is not good idea for reasons not listed because I have again written a book in an issue", but I don't think the alternatives are really feasible.

matkoniecz commented 5 years ago

Thanks for a feedback!

So there's really no need for any local server

That would be a great news, I will look at this competing solution again (and end writing the next chapter of the book).

I've seen the comment above "but it is not good idea for reasons not listed because I have again written a book in an issue"

tl;dr: storing data in user preferences means that there is no reasonable way to handle data corruption and it is tricky to support data already lost by users reinstalling application.

There are following problems:

  1. Maybe I care too much about the name, but it is not really user preference.
  2. Once things go wrong and invalid data is saved there I see no reasonable way to fix this.
  3. unfortunately at least some people would be negatively impacted as they already switched phones/reinstalled application and their stars would be "lost"
  4. StreetComplete would need to ask for one more access right from people
  5. easy solution (read/increment edit count) is going to fail with editing on more than one device (and conflicts are more likely to happen that may be expected - with user connecting to mobile network or wifi multiple devices are likely to connect at the same time)

Looking at problems again:

  1. But maybe as long as I will not try to store 120GB of my backup data there it is OK to store data in 100 keys directly used by the editor?
  2. Write this properly, avoid bugs and have some special override for cases of mangled data? After all, the worst case is that people will merely cheat on themselves.
  3. This is the worst problem. Maybe run script for statistics generation once and embed this data into app to allow restoration of "lost" stars? (uid + edit_count) * user_count may be not so prohibitive? How many SC users are there? How much app download size can be wasted on this?
  4. Just ask for permission? Not sure how many users will be lost as result of this. 0.1%? 1%? 10%? Or do not ask and ask on starting use on a new device (it would be painful when combined with permanent loss of statistics predating usage of preferences)
  5. each device has its own preference key for each saved preference? It will still run in equivalents of https://superuser.com/questions/1258721/reasons-not-to-assume-device-mac-address-is-unique or into problem of synchronization issues - but maybe it is OK to ignore this? Or just assign sequential ids and assume that there will be no race condition with simultaneous logging in for the first time on multiple devices.

Overall, main problem is that it is basically using external database without admin access to it. There is no way to initially populate it and there is no way to debug/fix it once things start failing.

And it is mostly caused by not storing preference data there but by attempting to store there data that would not be (easily) recoverable once lost.

And things like "is there a week with at least one quest solved on each day" would be basically impossible to do while easy to add to a database.

The only thing you can do is to cheat on yourself, which is kind of pointless anyway

As long as no public leaderboards exist it should be no problem at all. And public leaderboards are, at least in my opinion, not worth the trouble given type of encouraged behavior. And anyway one may just parse changeset planet file to get them if someone really wants that.

westnordost commented 5 years ago

I also find the solution of a small backend (php script + a few tables) the best, it is also the most flexible, as the solved quests could be sorted by country/city, the user's "streak" (consecutive mapping days) could be shown, or "indirect" leaderboards like "you are among the top 10 contributors (for quest type WX (in YZ))".

matkoniecz commented 5 years ago

small backend (php script + a few tables)

This seems feasible in general and I expect to be a able to implement it (especially given that with this architecture it should be possible to recover from data corruption), but I am stuck at "how access to server with statistics would be authenticated".

Making this API open has a separate set of issues that may be potentially problematic and I have no good idea how to handle authentication in any reasonable way.

Requiring users to create separate password for StreetComplete is not ok. And passing OpenStreetMap access token to server seems to be also a very bad idea. Or maybe it would be ok? Or is there some other way to establish user-specific secret? Leaving this API public has its own issues.

mmd-osm commented 5 years ago

storing data in user preferences means that there is no reasonable way to handle data corruption and it is tricky to support data already lost by users reinstalling application.

No, you could fix this from any Javascript page, which has been authenticated against osm.org, and simply read and update the user preferences. It's nothing really magical, just a key value store. The only downside I see is that it's open to all apps, so other evil apps could silently change/delete some keys. Maybe it would be an idea to extend this concept to include the OAuth app, so different apps are isolated from each other (this would require some changes to the API which could be relevant to other apps as well).

Data lost by users: yes, that's not covered. I don't know how much of a problem that is.

each device has its own preference key for each saved preference

Depends on how you use the key value store. If you have one key for SC (per user), they would all share the same value.

There is no way to initially populate it and there is no way to debug/fix it once things start failing.

Once you roll out this change, you would simply take the current counter and publish it in the user preferences. I don't see the debug/fix part as an issue, overpass turbo works just fine with that kind of storage.

but by attempting to store there data that would not be (easily) recoverable once lost.

You can fetch the same via a simple HTTP call as authenticated user, same it updating and deleting.

Maybe you need to do a multi step approach anyway, starting small by keeping the current star count on osm.org, and later add something more sophisticated on your own server.

matkoniecz commented 5 years ago

No, you could fix this from any Javascript page, which has been authenticated against osm.org, and simply read and update the user preferences. It's nothing really magical, just a key value store.

Each user can do this relatively easily. But app author is unable to fix it directly. For example bug in app set this value to 0 for all users that used broken version. Now what?

Evil apps are probably not a problem, it is just number - not something that is valuable for others or even prominent enough to make interesting to vandalize it.

Once you roll out this change, you would simply take the current counter and publish it in the user preferences.

Current star count may not include count from a previous device/installation.

mmd-osm commented 5 years ago

For example bug in app set this value to 0 for all users that used broken version. Now what?

That's pretty much the same as today, in case the app is broken in some way and messes with the local counter. How do you handle this as of now?

Current star count may not include count from a previous device/installation.

One option to retrieve this kind of information would be by processing all changeset metadata (preferably as a dump + regular updates), and analyze relevant changesets (also preferably using a dump + regular updates), both of which require a dedicated server. The server has to publish the star count via some API for the app to fetch. That's the "more sophisticated" thing I mentioned before.

Another option: maybe you could talk to the OSMCha folks, if you could use their API for your purposes? They provide filters based on username, period of time, editor, and return a number of create/modify/delete operations per changeset. I think that could be a good starting point for your stats.

westnordost commented 5 years ago

Requiring users to create separate password for StreetComplete is not ok. And passing OpenStreetMap access token to server seems to be also a very bad idea. Or maybe it would be ok? Or is there some other way to establish user-specific secret? Leaving this API public has its own issues.

The changeset history is part of the public API that does not require authentication. So, why authenticate in the first place? Also, weren't your scripts based on digging through the planet file with history on a daily/weekly (or so) basis rather than talk with the OSM API?

matkoniecz commented 5 years ago

Yes, it aggregates public data but

  1. processing public data is not always acceptable and in some cases is very invasive (not claiming that it applies here, but "only public data was used" is not always enough to make it OK).
  2. there are plans to restrict this data due to GDPR what indicates that it may be considered as private - see https://wiki.openstreetmap.org/w/images/8/88/GDPR_Position_Paper.pdf ("Reduce data availability as proposed in Appendix B" and entire appendix B)
    1. I am not sure about status of GDPR-related changes (AFAIK no data restriction happened, and GDPR is now active) - but there is at least some risk of suddenly becoming obligated to restrict access. As minimum it would be nice to have plan for implementing this.
    2. OSM precedents: see https://josm.openstreetmap.de/ticket/15754 where including edit count in JOSM changeset tags was rejected due to privacy concerns, HDYC ("how did you contribute") is a bit similar service, it was put behind (toothless) login wall due to privacy concerns - see https://help.openstreetmap.org/questions/56066/does-neis-one-how-did-you-contribute-require-login - both of this was in my opinion an overkill, but it indicates that at less part of OSM community considers privacy more widely than me

Overall, I think that it would be OK to make this public - but I am not sure is it a consensus and GDPR may obligate us to do this anyway.

mmd-osm commented 5 years ago

The following wiki page summaries all changes that are planned for GDPR compliance: https://wiki.openstreetmap.org/wiki/GDPR/Affected_Services

I don’t recall where changes to planet or diff files are documented. I think there was some plan to require an additional log on.

There were some blog posts to find people implementing those changes.

User display name, id and changeset are typically hidden as an anonymous user.

Besides hdyc OSMCha also requires a logged on user before any changeset is shown.

westnordost commented 5 years ago

So, the part of getting the data should be fine, because we could have a logged in streetcomplete user to retrieve the data that is not public (in the future). (There is actually already one.) It will however be required to delete the data associated with a user when that user decided to delete his account on openstreetmap.org. If I remember correctly, a list of deleted user ids is made public somewhere, so the backend needs to check every now and then if it has data of deleted users and if yes, delete this data.

On the part of giving out that data, on the client side, StreetComplete will of course only start showing the data once the user is successfully logged in. This is of course only a data protection through the client. On the backend side, a simple measure would be to only allow access if the user agent is StreetComplete.

This measure of course is ineffective from a data security point of view, but neither is the (planned) measure to restrict access to changeset information to only logged-in OSM users nor the measure of HDYC: Any person with a little technical background will be able to circumvent it to get the data anyway.

But I believe data security not to be the point - if someone deliberately circumvents such a measure, he is aware that access is not allowed and that he is potentially in breach of the GDPR.

westnordost commented 5 years ago

The other, stronger measure, is to present each user with another OAuth login screen when entering the statistics screen, but this one identifies as a different app, i.e. "StreetComplete Stats" and requires actually no permissions at all from the user but only takes the token to authenticate the user - similarly as HDYC does it. But I do not believe this is necessary.

matkoniecz commented 5 years ago

For complete overkill, without any additional logins, app may create public+private key, save private as user preference (globally), create changeset containing public key as one of tags and call the SC server with pointer to that changeset.

API would be able to encrypt responses directed to that user with public key and make it usable only to holder of private key (this specific user).

But hopefully

But I believe data security not to be the point - if he deliberately circumvents such a measure, he is aware that access is not allowed and that he is potentially in breach of the GDPR.

is good enough for this purpose, this data will be anyway trivial to reconstruct from history and based solely on past edits.

Such overengineering would be hilariously pointless as script computing this data will be released as open source. So anyone will be able to anyway run this script (or write their own) and compute all this statistics anyway.

westnordost commented 5 years ago

Pretty cool authentication idea though. "OpenOsmChangesetId" (analogous to OpenID), heh

mmd-osm commented 5 years ago

Overpass api has a similar requirement to only hand out data for a logged on osm.org user - without osm.org having to know when and which query was executed.

There’s a proposal to generate a token on osm.org which can later be presented to Overpass api. Once a valid token is presented to the server, a user will have full access to metadata.

Downside of it is that it’s not yet available on production.

https://github.com/openstreetmap/openstreetmap-website/pull/2145

matkoniecz commented 4 years ago

That's pretty much the same as today, in case the app is broken in some way and messes with the local counter. How do you handle this as of now?

Release the new version that will not have this bug. It will retrieve new data from the statistics server with statistics, and replace its older data.

Similar thing could be done with client-only data storage in preferences. But after update instructing to throw away cache it would require all StreetComplete users to redownload past statistics using OSM API.

With central server after complete data corruption it would be possible to generate this data from history or changeset dump or use other smart solution. With client-only storage the solution would require downloading all this using OSM API.

matkoniecz commented 4 years ago

The plan for now is to rely on changeset metadata only and make simple server publishing total star count and

Find all changesets by user, paginate through it because only 100 are shown at a time, for all changesets that are created_by StreetComplete, download the whole changeset (ugh...) and count the number of unique elements affected.

With exception that rather downloading full changeset it would use "changes_count" attribute

For example https://api.openstreetmap.org/api/0.6/changesets?user=1722488

Yes, I noticed the new "changes_count" attribute (see https://api.openstreetmap.org/api/0.6/changeset/70163079), but how are reverted changes counted?

https://api.openstreetmap.org/api/0.6/changeset/79195526 - modifying and reverting edits in a single changeset counts as two edits, the same as in SC #1537

In case of deciding that #1537 should be implemented (it is WONTFIX at this moment), it would be possible to add some sort of metadata to changesets to count/mark undoes.

matkoniecz commented 4 years ago

Implementing this, using https://wiki.openstreetmap.org/wiki/API_v0.6#Query:_GET_.2Fapi.2F0.6.2Fchangesets :

Step 1: fetch data using API and create a local database with table that has following fields

With such table we should be able to answer questions about statistics.

To coordinate fetching data it is necessary to have closed status for changesets to list one where edit count may become different and have for each user info about


Initial implementation will do easiest possible thing


initial download:

For given user download sequentially changeset data, using https://api.openstreetmap.org/api/0.6/changesets?user=1722488&time=$BIRTH_DATE_OF_SC,$LOWER_RANGE_BOUNDARY

Earliest created_at date of downloaded ones is the new latest date that still may have earlier unfetched changesets. Upper date boundary updated to max of its value and latest changeset. Fetched data should be stuffed into the database (only SC changesets). Repeat, until earliest date of possible new changesets is greater than date ).


Data update:

westnordost commented 4 years ago

@matkoniecz What's the status? I am almost done with the UI part

westnordost commented 4 years ago

Mateusz is currently busy with other things. Is anyone interested in doing it, maybe someone who created a backend before - @ENT8R or @exploide ? @matkoniecz already did some important considerations in the above comments. Otherwise I will get to it as soon as I am done with the frontend part.

exploide commented 4 years ago

Hi, I think I can't do this at the moment, sorry. If that changes soon, I will let you know. But if I should look into some detail just ping me.

ENT8R commented 4 years ago

Same for me... I also don't have that much time currently 😞

westnordost commented 4 years ago

Okay, thanks for the answers. Then I'll get to it next. I will do another post here once I start working on it.

westnordost commented 4 years ago

I'm starting to work on it now

westnordost commented 4 years ago

Almost done: https://github.com/westnordost/sc-statistics-service

@ENT8R , @exploide , @matkoniecz would you review it and open issues on the issue tracker there if you find something?

What is missing is the "index.php" with which to get the data and trigger the collection of the data as well as the cronjob that updates the data. However, these scripts will be rather short because all the logic is in the classes.

westnordost commented 4 years ago

Hmm while testing the PHP implementation I am getting doubts as to whether it makes sense at all to have a backend for this rather than calculate it directly on the device.

Looking through my whole changeset history took less than 6 seconds. I made 3,259 changesets. So that's about 600ms for each batch of 100 changesets. So even looking through the massive changeset history of a user like @matkoniecz (32,333 changesets) takes about a minute - once.

Even though I am almost finished with the PHP implementation, I should take a step back first and contrast the two options:

Advantages of local implementation:

Though, since the implementation in PHP is already done now, this is not that much of a good reason.

Advantages of backend:

Non-advantages of backend:

TODO2: what more advantages would a deep analysis of the changesets bring?

westnordost commented 4 years ago

Regardless, I am done with the implementation in PHP+MySQL for anyone wanting to review it. Thanks so far, @exploide

HolgerJeromin commented 4 years ago

I think ignoring quests which were solved >250 meters are not possible in both approaches. Or do I miss something?

So reinstalling can be considered a hack for boosting your star count.

westnordost commented 4 years ago

Not sure what you mean.

On 18 April 2020 08:35:57 CEST, Holger Jeromin notifications@github.com wrote:

I think ignoring quests which were solved >250 meters are not possible in both approaches. Or do I miss something?

So reinstalling can be considered a hack for boosting your star count.

HolgerJeromin commented 4 years ago

The app counts stars. If I am 1000 Meters away from the quest location I can solve the quest, but my star count will not rise (+0 stars).

When I change my phone this new feature will restore the star count. But the 1000 meter away quest will be +1 on the new device. And not +0 as on the old phone.

westnordost commented 4 years ago

If I am 1000 Meters away from the quest location I can solve the quest, but my star count will not rise (+0 stars).

This is not correct. Your star count should rise, it is a solved quest as any other.

westnordost commented 4 years ago

Okay, I added that reverted solved quests are not counted for the star-count as well as split-ways are only counted once by doing a deep analysis of the changesets. Additionally, I added a geocoder so that the changes can now be associated to countries. Here is an example output for a user:

{
  "questTypes": {
    "AddAccessibleForPedestrians": 24,
    "AddAddressStreet": 2,
    "AddBenchBackrest": 62,
    "AddBikeParkingCapacity": 9,
    "AddBikeParkingCover": 78,
    "AddBikeParkingType": 3,
    "AddBridgeStructure": 5,
    "AddBuildingLevels": 199,
    "AddBuildingType": 581,
    "AddBusStopName": 1,
    "AddBusStopShelter": 59,
    "AddCarWashType": 3,
    "AddCrossingType": 146,
    "AddCycleway": 41,
    "AddCyclewaySegregation": 2,
    "AddFireHydrantType": 1,
    "AddForestLeafType": 1,
    "AddHandrail": 2,
    "AddHousenumber": 26,
    "AddMaxHeight": 13,
    "AddMaxSpeed": 87,
    "AddMaxWeight": 2,
    "AddOneway": 1,
    "AddOpeningHours": 88,
    "AddParkingAccess": 40,
    "AddParkingFee": 18,
    "AddParkingType": 35,
    "AddPathSurface": 171,
    "AddPlaceName": 20,
    "AddPlaygroundAccess": 26,
    "AddPostboxCollectionTimes": 1,
    "AddProhibitedForPedestrians": 20,
    "AddRailwayCrossingBarrier": 33,
    "AddRecyclingContainerMaterials": 1,
    "AddRecyclingType": 1,
    "AddReligionToPlaceOfWorship": 9,
    "AddReligionToWaysideShrine": 8,
    "AddRoadName": 172,
    "AddRoadSurface": 682,
    "AddRoofShape": 61,
    "AddSegregated": 2,
    "AddSidewalk": 21,
    "AddSport": 4,
    "AddTactilePavingBusStop": 10,
    "AddTactilePavingCrosswalk": 161,
    "AddToiletsFee": 2,
    "AddTracktype": 1,
    "AddTrafficSignalsButton": 4,
    "AddTrafficSignalsSound": 3,
    "AddVegetarian": 3,
    "AddWayLit": 822,
    "AddWheelchairAccessBusiness": 8,
    "AddWheelchairAccessDogPark": 1,
    "AddWheelchairAccessToilets": 2,
    "DetailPavedRoadSurface": 1,
    "IsBuildingUnderground": 1,
    "MarkCompletedBuildingConstruction": 6,
    "MarkCompletedConstruction": 3,
    "MarkCompletedHighwayConstruction": 31
  },
  "countries": {
    "AT": 1,
    "CN-XZ": 3,
    "CY": 66,
    "CZ": 11,
    "HR": 1,
    "IL": 10,
    "IT": 39,
    "MA": 3,
    "PL": 3678,
    "SK": 8
  },
  "daysActive": 174,
  "lastUpdate": "2020-03-22T12:55:14+00:00"
}
westnordost commented 4 years ago

I plan to deploy it this evening or tomorrow, last chance to review the code before I destroy my webspace by deploying an insecure PHP implementation ;-) @Akasch @matkoniecz @ENT8R

@exploide found some caveats that are PHP specific of which I am thankful because it is PHP knowledge I lack. Also, @exploide , maybe you would like to review what I added this weekend?

westnordost commented 4 years ago

Also, @matkoniecz , do you have a list of all user ids of streetcomplete users? If I populate the database with data before it is queried, users will get their statistics right away without waiting.

exploide commented 4 years ago

I took a look at the changes and I think it is fine.

In get_statistics.php there is still the GDPR TODO concerning user-agent "protection".

A minor comment concerning mysqli_report in this file: For public facing web endpoints, one usually wants to disable SQL error reporting completely.

westnordost commented 4 years ago

In get_statistics.php there is still the GDPR TODO concerning user-agent "protection".

Jup I know, will do that at the very end.

A minor comment concerning mysqli_report in this file: For public facing web endpoints, one usually wants to disable SQL error reporting completely.

Thanks for the hint!

HolgerJeromin commented 4 years ago

Thanks! Now I am finally feeling home on the phone I got 7 month ago. :-) (6271)

westnordost commented 4 years ago

Congratulation, I only have 2003

smichel17 commented 4 years ago

1991 here, about to catch you

…but really, you have many more, since mine (and many others) would not exist without this app :)