pelias / openaddresses

Pelias import pipeline for OpenAddresses.
MIT License
51 stars 43 forks source link

Importer data download fails - OA now requires authentication #484

Closed noalevy closed 3 years ago

noalevy commented 3 years ago

Hi all, I'm trying to import the OA data for a full planet build of Pelias. When executing npm run download, download fails. Looks like the openaddresses project recently added a requirement to authenticate before downloading data, so the URLs the importer tries to access (https://data.openaddresses.io/openaddr-collected-global.zip and https://data.openaddresses.io/openaddr-collected-global-sa.zip) no longer work. Probably the documentation needs to be updated to reflect this, plus it would be great to also include instructions on how to obtain the data (I'm not yet sure how it can be done). Thanks!

missinglink commented 3 years ago

Which command are you using to download the data?

missinglink commented 3 years ago

I'm just wondering if this is the same issue as https://github.com/pelias/openaddresses/issues/482 or whether additional restrictions were put in place since then

noalevy commented 3 years ago

npm run download

noalevy commented 3 years ago

I'm just wondering if this is the same issue as #482 or whether additional restrictions were put in place since then

I tried adding the referrer as suggested in #482, it didn't work

missinglink commented 3 years ago

Hi @noalevy I had another look at this today and I wasn't able to reproduce 🤷‍♂️ I only used cURL, so it's possible that if these commands work for you, there is a bug in our JS codebase.

curl -s -L -X GET \
  --referer 'https://pelias-results.openaddresses.io' \
  https://data.openaddresses.io/openaddr-collected-global.zip \
    | head -c 40

P;�R�i-�"�
README.txt%
curl -s -L -X GET \
  --referer 'https://pelias-results.openaddresses.io' \
  https://data.openaddresses.io/openaddr-collected-global-sa.zip \
    | head -c 40

P;�R�i-�"�
README.txt%

I can confirm that without the referer header it does return 403 Forbidden.

curl -s -L -X GET \
  https://data.openaddresses.io/openaddr-collected-global.zip \
    | head -c 40

<html><head><title>403 Forbidden</title>%

Can you please try the commands above and let me know if you think there is an error in our codebase? I copied the cURL command from the code, so if so, I'm not sure what it could be 🤷‍♂️

noalevy commented 3 years ago

Thank you @missinglink - using curl, it works with the referrer, thanks!!

I'm really not sure why using npm run download fails for me - I just ran it again. This is the output I get - it looks like it's getting a very quick response which is not a zip file:

2021-07-07T12:16:50.885Z - info: [openaddresses-download] Attempting to download all data
2021-07-07T12:16:50.889Z - debug: [openaddresses-download] downloading https://data.openaddresses.io/openaddr-colle
cted-global.zip
2021-07-07T12:16:50.894Z - debug: [openaddresses-download] downloading https://data.openaddresses.io/openaddr-colle
cted-global-sa.zip
2021-07-07T12:16:51.115Z - debug: [openaddresses-download] unzipping /tmp/202167-25692-1a8kwet.pag.zip to /home/ubu
ntu/pelias/openaddresses/
2021-07-07T12:16:51.124Z - error: [openaddresses-download] Failed to download data message=Command failed: unzip -o
 -qq -d /home/ubuntu/pelias/openaddresses/ /tmp/202167-25692-1a8kwet.pag.zip
[/tmp/202167-25692-1a8kwet.pag.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /tmp/202167-25692-1a8kwet.pag.zip or
        /tmp/202167-25692-1a8kwet.pag.zip.zip, and cannot find /tmp/202167-25692-1a8kwet.pag.zip.ZIP, period.
, stack=Error: Command failed: unzip -o -qq -d /home/ubuntu/pelias/openaddresses/ /tmp/202167-25692-1a8kwet.pag.zip
[/tmp/202167-25692-1a8kwet.pag.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /tmp/202167-25692-1a8kwet.pag.zip or
        /tmp/202167-25692-1a8kwet.pag.zip.zip, and cannot find /tmp/202167-25692-1a8kwet.pag.zip.ZIP, period.

And this is the debug log:

0 info it worked if it ends with ok
1 verbose cli [ '/usr/bin/node', '/usr/bin/npm', 'run', 'download' ]
2 info using npm@6.14.12
3 info using node@v12.22.1
4 verbose run-script [ 'predownload', 'download', 'postdownload' ]
5 info lifecycle pelias-openaddresses@0.0.0-development~predownload: pelias-openaddresses@0.0.0-development
6 info lifecycle pelias-openaddresses@0.0.0-development~download: pelias-openaddresses@0.0.0-development
7 verbose lifecycle pelias-openaddresses@0.0.0-development~download: unsafe-perm in lifecycle true
8 verbose lifecycle pelias-openaddresses@0.0.0-development~download: PATH: /usr/lib/node_modules/npm/node_modules/n
pm-lifecycle/node-gyp-bin:/home/ubuntu/openaddresses/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/us
r/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
9 verbose lifecycle pelias-openaddresses@0.0.0-development~download: CWD: /home/ubuntu/openaddresses
10 silly lifecycle pelias-openaddresses@0.0.0-development~download: Args: [ '-c', './bin/download' ]
11 silly lifecycle pelias-openaddresses@0.0.0-development~download: Returned: code: 1  signal: null
12 info lifecycle pelias-openaddresses@0.0.0-development~download: Failed to exec download script
13 verbose stack Error: pelias-openaddresses@0.0.0-development download: `./bin/download`
13 verbose stack Exit status 1
13 verbose stack     at EventEmitter.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/index.js:332
:16)
13 verbose stack     at EventEmitter.emit (events.js:314:20)
13 verbose stack     at ChildProcess.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/lib/spawn.js
:55:14)
13 verbose stack     at ChildProcess.emit (events.js:314:20)
13 verbose stack     at maybeClose (internal/child_process.js:1022:16)
13 verbose stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:287:5)
14 verbose pkgid pelias-openaddresses@0.0.0-development
15 verbose cwd /home/ubuntu/openaddresses
16 verbose Linux 5.4.0-1044-gcp
17 verbose argv "/usr/bin/node" "/usr/bin/npm" "run" "download"
18 verbose node v12.22.1
19 verbose npm  v6.14.12
20 error code ELIFECYCLE
21 error errno 1
22 error pelias-openaddresses@0.0.0-development download: `./bin/download`
22 error Exit status 1
23 error Failed at the pelias-openaddresses@0.0.0-development download script.
23 error This is probably not a problem with npm. There is likely additional logging output above.
24 verbose exit [ 1, true ]
missinglink commented 3 years ago

oh, that's interesting, I can confirm that it happens for me too:

npm run download                                24s

> pelias-openaddresses@0.0.0-development download
> ./bin/download

2021-07-07T12:35:11.655Z - info: [openaddresses-download] Attempting to download all data
2021-07-07T12:35:11.906Z - error: [openaddresses-download] Failed to download data message=Command failed: unzip -o -qq -d /tmp/openaddresses /var/folders/5w/xzyyzb490m36my2j2w4094_m0000gn/T/202167-56394-zpnxlw.nqfo.zip
[/var/folders/5w/xzyyzb490m36my2j2w4094_m0000gn/T/202167-56394-zpnxlw.nqfo.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  ...
ls -lah /var/folders/5w/xzyyzb490m36my2j2w4094_m0000gn/T/202167-56394-zpnxlw.nqfo.zip
-rw-r--r--  1 peter  staff   2.3K Jul  7 14:35 /var/folders/5w/xzyyzb490m36my2j2w4094_m0000gn/T/202167-56394-zpnxlw.nqfo.zip

head -c 40 /var/folders/5w/xzyyzb490m36my2j2w4094_m0000gn/T/202167-56394-zpnxlw.nqfo.zip
<html><head><title>503 Service Unavailab%

It's not a 403 Forbidden it's a 503 Service Unavailable. I'll do a bit more digging

missinglink commented 3 years ago

It seems to be working intermittently for me, I can literally run the same command, have it fail twice or more, then run it again and it downloads successfully 🤔

I'll reach out to Ian at openaddresses and see if he knows what's going on, since I'm pretty sure its something on their end.

missinglink commented 3 years ago

I suspect his response will be for us to stop using the v1 API and start using the new v2 API which requires authentication.

noalevy commented 3 years ago

Thanks for taking care of this @missinglink!

missinglink commented 3 years ago

Okay so Ian says there is a Maximum Connections Per IP limit of 1.

Maximum Connections Per IP

The maximum connections per IP setting allows you to limit the number of simultaneous connections each IP can have to our servers. This is useful to prevent users from downloading a large number of files at the same time or to stop DDoS attacks. The limit is applied per server, so if one of our locations has multiple IPs or if the IP is connecting to multiple locations, the limit will apply for each server separately.

So it should be a simple fix with async.eachSeries().

diff --git a/utils/download_all.js b/utils/download_all.js
index a8ae67c..3aa3548 100644
--- a/utils/download_all.js
+++ b/utils/download_all.js
@@ -18,7 +18,7 @@ function downloadAll(config, callback) {

     const dataHost = config.get('imports.openaddresses.dataHost') || 'https://data.openaddresses.io';

-    async.each(
+    async.eachSeries(
       [
         // all non-share-alike data
         `${dataHost}/openaddr-collected-global.zip`,

Can you please try that change locally and let me know how you get on?

I tried it myself and it failed at first but then succeeded, it could be that the Bunny CDN servers take a little bit of time to allow your IP again, I believe this is to prevent DOS attacks, ~20s seemed sufficient for me.

orangejulius commented 3 years ago

Ahhh, a maximum connection limit would make sense. We've had similar reports from Gitter of things working intermittently. eachSeries will likely fix that for everyone.

missinglink commented 3 years ago

Resolved via https://github.com/pelias/openaddresses/pull/485, please pull the latest code/images and try again.

Please comment here if the issue has not been resolved.

note: you may only have one concurrent OA download in flight from your IP at a time, this may be an issue when you have another terminal open downloading data or a colleague is also downloading data from the same shared external IP.