Repeated download of extra data from nominatim.org

lonvia commented 1 year ago

Following up on #413, I'm afraid it looks like you are the culprit regarding recent mass downloads. The mass downloaders have changed the user agent from curl/7.68.0 to {}. This user agent would be produced from your script (with the typo in the user agent customization) with an empty USER_AGENT. There is a tiny chance that somebody else made the same typo but that seems rather odd and the first {} user agent appeared after your change yesterday. User agents with 'mediagis/nominatim-docker' have only showed up first about 1300UTC today.

Here is what the nginx log files look like:

......84 - - [05/Jan/2023:20:12:54 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:12:56 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:13:49 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:13:51 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:14:47 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:14:49 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:15:43 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:15:45 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:16:40 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:16:42 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:17:37 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:17:39 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:18:32 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:18:34 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"
......84 - - [05/Jan/2023:20:19:31 +0000] "GET /data/wikimedia-importance.sql.gz HTTP/1.1" 200 393574858 "-" "{}"
......84 - - [05/Jan/2023:20:19:33 +0000] "GET /data/gb_postcodes.csv.gz HTTP/1.1" 200 13609295 "-" "{}"

I've banned the {} user agent now as well.

lonvia commented 1 year ago

To add a datapoint that might help tracking down the issue: the first mass downloads happened around the 9th December.

leonardehrenfried commented 1 year ago

First of all, I'm sorry this image is causing you grief. It looks very much like it's the image that is the cause of your download bill.

It's safe to assume that an automated process that is downloading those files over and over.

The thing is that we have several millions of downloads and it would be easy to accidentally or intentionally write a script that does that.

I'm thinking hard about if we have such a process running somewhere that would do that.

Can you say something about the IP addresses that these requests originate from? Is it always the same one?

lonvia commented 1 year ago

There are many IPs (about a dozen in the last two weeks) and they are all over the place: Amazon, Azure, Hetzner, some provider in Australia, Austria, Hongkong... So, I doubt it's a single user's private script causing the problems. Some constellation of using the image means that the init.sh script will be executed over and over again. I can't see any obvious errors in this repo but then again I don't know enough about Docker to understand the intricate details of error handling.

I understand that it may not be necessarily your fault and that you can't control your users. But sadly the fact is also that I can't permit you to use the downloads from nominatim.org until we've either come to the bottom of this or put measures in place to prevent repeated downloads.

Note that the downloads are in such a quick succession that I'm pretty sure that it's only the init.sh (or maybe start.sh) that's run in a loop, not the entire container setup. So one thing you could try is to save the etags of the downloaded files and then change curl to make 'if-modified' requests.

If you haven't done yet, you might also want to add a test to the CI to check if touch ${IMPORT_FINISHED} really works.

leonardehrenfried commented 1 year ago

I totally understand and I don't blame for locking the user agent out. Once I get back to the office I will take a look if I can spot something.

philipkozeny commented 1 year ago

Same here. We'll look through the scripts and see if we spot something.

FYI: We don't run anything automated on our side, besides the Github Actions on every commit (which don't happen so often).

Relly looks like a container that's stuck on import and just reruns all the time.

lonvia commented 1 year ago

I've been thinking about possible mitigations and two things come to mind:

Move the download of the extra data to the Docker setup and provide the files with the image. That makes the image a bit larger but means people don't have to redownload on every reimport. They also wouldn't depend on nominatim.org being up once the container is built.
Fail the scripts when curl cannot download the files (using 'curl -f'). This would be a good idea in any case because currently either nothing or the error page is saved. Both lead to rather interesting errors down the line.

leonardehrenfried commented 1 year ago

@lonvia I've openend a PR that sets a proper user agent (#417) so be prepared to start seeing "mediagis/nominatim-docker:4.2.0" as a user agent.

The same PR also uses --fail-with-body for all curl commands in init.sh.

I'm not super-hopeful but maybe that helps the problem.

lonvia commented 1 year ago

Our inconsiderate users have faithfully downloaded and deployed your latest version and I now see "mediagis/nominatim-docker:4.2.0" doing mass downloads. After blocking the user agent, only the request for the wikidata file remained. So, good news: the --fail-with-body parameter is doing it's job.

This behaviour makes me think that the init script was previously failing for the users at a different place. PRs #400 and #402 were the last ones committed shortly before the bad behaviour started. Anything suspicious in there that might cause Docker think that the command has failed?

sde-melo commented 1 year ago

Why not writing a link to this issue in the output on failure so that everybody can check their process? (and maybe encourage to use a custom user agent) Sorry if it has already been done.

lonvia commented 1 year ago

I have now feedback from a user who says that he run into a disk-full issue which eventually triggered the repeated downloads. So maybe it wasn't a change to the image that triggered the issue en masse but simply OSM data size going over a threshold and starting to cause trouble. I will try to get more information.

leonardehrenfried commented 1 year ago

Thanks for the update!

ryan-morris commented 1 year ago

I have now feedback from a user who says that he run into a disk-full issue which eventually triggered the repeated downloads. So maybe it wasn't a change to the image that triggered the issue en masse but simply OSM data size going over a threshold and starting to cause trouble. I will try to get more information.

So unfortunately, I don't have the log files any longer. The environment variables were configured to download gb/us postcodes + tiger data, so what it appeared was happening was:

download gb post codes
download us post codes
download tiger data
attempt to download pbf

During step 4 the space was being exhausted, the container would crash, automatically restart, kicking off step 1 and looping indefinitely. When it restarted, I am assuming since it was overwriting the same 3 files, it would do a full download which did not error since the required space was the same as the already downloaded version, preventing it from error out when trying to download the gb post codes subsequent times through.

I'm really sorry for having helped contribute to the issue and excess bandwidth utilization. If there's any more information I can provide or anything you want me to test please let me know.

leonardehrenfried commented 1 year ago

Did you start the container with something like restart=always or in a systemd unit that restarts automatically.

This image doesn't do any retries or automatic restarts.

ryan-morris commented 1 year ago

Did you start the container with something like restart=always or in a systemd unit that restarts automatically.

This image doesn't do any retries or automatic restarts.

I went back and looked and yes, a docker-compose file had been used with the following:

    deploy:
      replicas: 1
      restart_policy:
        condition: always

So it seems like this was triggered more by a "misconfiguration" on our side. The only thing that could have helped alleviate the issue was not redownloading the post code + tiger data files since they already existed in the directory. Had the postcodes or tiger data failed to download completely though, not redownloading them would have resulted in an error later during import. That would not have stopped the pbf from at least partially downloading over and over until disk space ran out again.

philipkozeny commented 1 year ago

@leonardehrenfried @lonvia

In order to address the concerns raised here, I'd like to discuss the following approach:

Enforce setting a custom user-agent when utilizing the image: This will help mitigate potential issues like the one mentioned, allowing requests to be more easily tracked and blocked if necessary.
Implement a check to skip file downloads if they already exist: This could save time during the build process and prevent redundant downloads. We should also account for partially downloaded or corrupted files, possibly by verifying file integrity before skipping the download.
Reconsider including all data files in the base image: I have a concern about increasing the image size; including the files could lead to more stable builds and reduce the risk of broken files. However, we should weigh the pros and cons before making a decision since the image size will increase for everyone.

I believe implementing the custom user-agent requirement is a good starting point, and we can continue the discussion on how to handle data file downloads more efficiently if the problem arises again. What are your thoughts on this approach?

leonardehrenfried commented 1 year ago

I think forcing users to set a user agent is not too much to ask if you want to use the downloads from nominatim.org. I would be happy with this approach.

We can throw a descriptive error when someone requests the downloads but hasn't set their own agent.

lonvia commented 1 year ago

People are effectively already setting their own user agent because the default user agent is blocked for all versions. Forcing a custom user agent just means more work for me because this blanked block does not work anymore.

The only permanent solution is 2) or 3).

Or you make sure that the image refuses an automatic restart when installation fails.

philipkozeny commented 1 year ago

If that's the case, I'll go ahead and start a PR for 2). Would you like us to eliminate the custom user agent in that PR and instead consistently use mediagis/nominatim-$version to identify ourselves once more?

lonvia commented 1 year ago

I'd say, leave it in for now. It is an okay work around for people who know what they are doing. I haven't had problems with custom user agents. That indicates that people who know how to change the user agent are not the kind of people who thoughtlessly set their docker container to auto-restart.

lonvia commented 1 year ago

Ping me when the changes are released and I can remove the user-agent ban for the newest version to see how it goes.

lonvia commented 1 year ago

It's been half a year and nothing has changed. I still have to regularly block users from your docker image. It's really no help when they dutifully change the user agent to "s***myboils" and then proceed to download terrabytes of data from the server because they don't now how to operate docker.

There are obvious solutions to the problem: either ship the image with the files or introduce a flag that will make sure that initialisation is never run twice. Or find somebody else to host the files but please stop abusing nominatim.org.

leonardehrenfried commented 1 year ago

I'm sorry, @lonvia.

I don't have much to in our defense to say other than it's impossible to detect if a container image has been used before because containers intentionally provide a clean slat.

I think we will have to simply disable this feature until we have figured out what to do.

leonardehrenfried commented 1 year ago

I've ordered a Hetzner Storage Box, will rehost the files there and change our download URLs.

leonardehrenfried commented 1 year ago

I have just merged #460 which uses the files from a rehosted file server. I hope this will drastically reduce the cases of this happening. I'm so sorry to be causing you grief, @lonvia.

leonardehrenfried commented 11 months ago

@lonvia it's been a few months since I rolled out the mirrored downloads. Has the situation improved?

lonvia commented 11 months ago

It's better. Oddly enough there are still the occasional loop downloader. Probably somebody using an outdated docker container. Not much we can do about that, I'm afraid.

leonardehrenfried commented 5 months ago

And how are we doing these days? Can I close this ticket, @lonvia?

lonvia commented 5 months ago

There's one or two hopeless cases but those are beyond your control. Feel free to close.

mediagis / nominatim-docker

Repeated download of extra data from nominatim.org #416