michaelvl / osm-analytic-tracker

OpenStreetMap Analytic Difference Engine
GNU General Public License v2.0
38 stars 1 forks source link

Unexpected amount of traffic #19

Closed mmd-osm closed 5 years ago

mmd-osm commented 5 years ago

Since some time your application running on host osm.expandable.dk appears in the top ranks in terms of number of total requests for osm.org. As an example, for September 2018, that's about 4 mio. hits. For an application analyzing changes in Denmark I find this a bit unexpected.

I'm not sure if you're aware of this situation. Maybe this would be a good opportunity to check for some hidden bugs. In addition there might be some potential to pull some of the information from other sources, like planet.osm.org minutely diffs (no I haven't looked at all in your code, this is just guessing).

michaelvl commented 5 years ago

Thank you for reporting this - this was certainly not expected. The only explanation I can come up with is that there has been a bot running updating many addresses in DK from official sources. However, whether this coincides with this high load I do not know. Are the load statistics public available and possible for other months than September?

Do you know the severity of this? If this is causing any issues I will disable the service while investigating the issue.

mmd-osm commented 5 years ago

Are the load statistics public available and possible for other months than September?

Yes, you can access all months in 2018, and possibly 2017 via AWStats. It takes quite some time to load, though: https://stats.openstreetmap.org/cgi-bin/awstats.pl?month=08&year=2018&output=allhosts&config=www.openstreetmap.org&framename=index

IIRC, the other months showed a similar pattern, that's why I was suspecting some particular pattern in your app caused this amount of requests. I don't know exactly which APIs you're consuming, but anything in the api.openstreetmap.org/* range is typically reserved for editing applications only, while analytical apps should fetch their data from the available diff sources.

Do you know the severity of this

That I have no idea about, OSM sysadmin team should be able to give you more details.

Hjart commented 5 years ago

osm.expandable.dk is an immensely valuable service to danish mappers (I have personally been following it pretty much every day since it was started), so I'll be sad to see it disabled for any length of time.

michaelvl commented 5 years ago

The Aug/Sep/Oct monthly average seems to be 4M API accesses amounting to 9G of traffic.

michaelvl commented 5 years ago

21 Metrics for monitoring API usage

michaelvl commented 5 years ago

22 Optimization of changeset metadata retrieval.

nrenner commented 5 years ago

I read that osm.expandable.dk is now down because of this.

Are you considering alternative data sources?

michaelvl commented 5 years ago

I'm not aware of alternative methods for retrieving changeset data. Nodes, ways etc. is possible by replication, but not the changeset itself AFAIK.

mmd-osm commented 5 years ago

What do you mean by "the changeset itself" exactly? Which API call does that refer to, as of now?

For changeset metadata, I mentioned the respective replication stream in the other issue: https://github.com/MichaelVL/osm-analytic-tracker/issues/22#issuecomment-441428771

michaelvl commented 5 years ago

The tracker have been updated to fetch the metadata from the changeset replication stream.

The changeset content is currently fetched from the API, url /api/0.6/changeset

mmd-osm commented 5 years ago

Ok, I think the main issue here is,that changesets are not atomic, e.g. as long as the changeset is open, you can upload changes to the same changeset as often as you can (up to 10'000 changes, with 1 hour max. idle time, and a total changeset lifetime of 24 hours). Every time you upload some changes, those will appear in the minutely diff. Details for one changeset could potentially span a number of minutely diff files then.

michaelvl commented 5 years ago

That is correct. The analyser fetches the complete changeset when it becomes closed and then displays the content. Collecting the content of a changeset over all the minutely diffs does not seem like an attractive approach and could also mean that some changes are lost if not all minutely diffs are processed.

mmd-osm commented 5 years ago

Comparing this approach with e.g. how OSMCHa's backend is implemented (https://www.openstreetmap.org/user/geohacker/diary/40846), they're basically creating some augmented diff for each changeset based on the information in the minutely diff. Those files are updated, if subsequent minutely diff files include additional changes for existing changesets. Even though the changeset hasn't been closed yet, they're already storing per-changeset information locally.

I believe that's probably the only reasonable approach based on minutely diffs. Of course, you need to closely monitor that no minutely diffs are skipped.

Hjart commented 5 years ago

What are the chances of osm.expandable.dk resuming service? Are you in contact with the osm server admins?

nrenner commented 5 years ago

The changeset content is currently fetched from the API, url /api/0.6/changeset

And to get tag changes and geometries for the visual diff, you need to additionally request the old version for all modified and deleted objects in the changeset and missing way nodes, etc.

You might be aware and already using it, but one possible short-term optimization could be requesting previous versions in batches using Multi fetch with version numbers, which were added in 2017.

Another option might be the cached changeset JSON diffs from the OSMCha setup that mmd already mentioned (given permission and still keeping a fallback when not available), e.g.: https://s3.amazonaws.com/mapbox/real-changesets/production/65725302.json

I deliberately avoided using the OSM API with achavi, besides the changeset meta data call. But Overpass adiff has it's limitations (also the OSMCha diffs) and I'm still thinking about what would be needed to get around them - mmd has made some suggestions - or potential alternatives like setting up an own OSM API database or other full history databases like AWS Athena or OSHDB.

So my intention is to share ideas and maybe find a common solution for tools like the Analytic Difference Engine, OSMCha and achavi (and perhaps openstreetmap.org as well). Still planning to do a more detailed roundup.