pierzen / osm-contributor-stats

Osm Contributor Statistics
GNU General Public License v3.0
12 stars 6 forks source link

Create a settings prompter to avoid errors and improve UX #2

Closed hyances closed 9 years ago

hyances commented 9 years ago

Below error is created running the script first-time-no-edit; prompting to user about required parameters to operate can avoid this:

~/osm-contributor-stats-master$ python Script-to-extract-Extract-Objects-Calculate-Statistics-from-OsmContributorStats-Module.py Traceback (most recent call last): File "Script-to-extract-Extract-Objects-Calculate-Statistics-from-OsmContributorStats-Module.py", line 15, in os.chdir('c:\OsmContributorStats\') OSError: [Errno 2] No such file or directory: 'c:\OsmContributorStats\'

pierzen commented 9 years ago

Before you import and instantiate both OsmApi and OsmContributorStats, you need to specify the directory where they are stored.

I have revised the instructions to remove any ambiguity in https://github.com/pierzen/osm-contributor-stats/blob/master/Script-to-extract-Extract-Objects-Calculate-Statistics-from-OsmContributorStats-Module.py

hyances commented 9 years ago

OK, thanks. I think we can improve the script until convert it as an app, actually we have in HOT: TM, Exports, OAM, etc; but (I think) any web application related to create stats from users working in different tasks or countries.

To start I propose to convert it as some "plug&play", just run the "Script2Extract", give the parameters of:

And then begin with queries & reports generation.

If you prefer I can work in a fork, but please let me know about your opinion.

pierzen commented 9 years ago

I have myself plans to develop such a website application. But first, we need a more robust script.

The module to extract data from OSMAPI is not yet robust enough to extract for an important volume of data. If time out, you have to rerun from the start. Using this to produce the Ebola activation contributor statistics, I had significant problems, especially in october 2014 with the record daily contributions.

To correct this problem, I started to rewrite the script using a sqlite database to store data. Il also stores metadata that let restart the script to extract from the point of the last timeout.

hyances commented 9 years ago

I'm facing similar problems with stats for Colombia, I can not pass from 17/01/2014 (starting from 1/01/2014), so an idea is split big BBOX in smallers pieces, just like TM does, so create a do while/for each of them, adding results in different reports selected; but your approach looks smart to fit with large extracts. BBOX splits could work better for administrative boundaries, maybe a mix?

pierzen commented 9 years ago

I have made a few experiments to rewrite the extraction of changesets. I tried a version extracting from Overpass cutting the BBOX in smaller parts. I ended to have the same constant timeout problem as for the original script from the OSMApi.

With OSMApi, the critical point is the extraction of the list of changesets and the limit of 100 changesets returned for a simple query. With the get /api/0.6/changesets, is is possible to play both with the BBOX and the time span. It is not possible to provide a polygon.

My solution is to loop and provide shorter time spans to query. On a daily basis, the script produces a list of Time Segments stored in command stack to query the list of changesets. If more then 100 changesets, only the first 100 are returned for a time segment. Then, I need to cut this Time segment, and add the new commands to the stack to proceed. By storing this metadata in sqlite, it is possible to detect where the procedure was interrupted and restart from this point.

Storing the metadata / restart where last query was interrupted, also reduce the load on the OSMApi, since you avoid to query twice for the same time span when you have a problem (timeout or other).

If you query for a town with this script, it is generally not a problem since you should have a maximum of a few hundred changesets daily. If you have thousand of changesets, even if you use hourly time spans, there are good changes that you will end up with a query that goes to the limit of 100 changesets. You then need a logic to take care of this.