mediagis / nominatim-docker

100% working container for Nominatim
Creative Commons Zero v1.0 Universal
1.08k stars 442 forks source link

Too frequent download attepts #577

Open woodpeck opened 3 weeks ago

woodpeck commented 3 weeks ago

I maintain the site download.geofabrik.de.

In my log files I am seeing a client identifying themselves as "mediagis/nominatim-docker:4.2.4" which attempts to download the same .osm.pbf file every 15 seconds (on average - this has been going on for days now). The server replies with a "304 not modified" response.

I don't know if this is standard behaviour or if there is maybe some sort of malfunction.

Since the download.geofabrik.de server only updates files once per day, it does not make sense to ask for an new file every 15 seconds. A site intent on speedy updates would not ask for full .osm.pbf files anyway but consume updates instead.

Sites making thousands of requests per day for a prolonged time will be blocked from accessing download.geofabrik.de.

It would be good if mediagis image could make sure not to issue an undue number of requests to download.geofabrik.de.

leonardehrenfried commented 3 weeks ago

Hi Frederik,

first of all, thanks for maintaining download.geofabrik.de and I'm sorry this image is giving you grief.

We have had a very similar problem previously: https://github.com/mediagis/nominatim-docker/issues/416

The bottom line is that this image makes it very easy to build a nominatim instance. This leads to an unfortunate situation where less skilled users set up installations that lead to the sort of problems you're seeing here.

My guess is that an installation encounters some sort of error and then retries the installation forever. The nature of container images is that there is no knowledge of the previous instance. To put it bluntly, a careless user ruins the experience for everyone else.

Can you block the user agent by IP address? That would be totally reasonable from my point of view.

woodpeck commented 3 weeks ago

Yes, once an IP has racked up a 5-digit access counts within a few days that IP will usually be blocked. The sad thing is that we can't convey a message back to them saying "you have been blocked because of X".

I can't see details about the request in my log file but since the server sends a "not modified" I guess the request must contain an "if-modified-since" header, so the client must have some state else it could not know which timestamp to put in the if-modified-since?

leonardehrenfried commented 3 weeks ago

Again, I'm really sorry for causing you problems.

I think rather than fiddle with the setup and deal with this problem for a long time, I would just make the image repository private so that some obstacles are being erected for new users. I don't have the energy and attention to keep dealing with careless users anymore.

cc @philipkozeny

woodpeck commented 3 weeks ago

That would probably be a somewhat extreme reaction ;) a short look at the log files tells me that i've had over 100 different IPs asking for something with a mediagis user agent in the last 24 hours, and only 10 of them made more than 100 requests (5 made more than 1000) - you'd probably throw the baby out with the bathwater.

philipkozeny commented 2 weeks ago

Apologies for the trouble you're experiencing with the image! Just a quick question: Are you noticing this behavior with the latest 4.4 version, or only with earlier versions? I ask because we recently removed the "restart: always" option from the contrib Docker template.

leonardehrenfried commented 2 weeks ago

That's a good point. We probably want to remove the restart:always from all of our docs, even the old ones.