pelias / docker

Run the Pelias geocoder in docker containers, including example projects.
MIT License
315 stars 218 forks source link

Download OpenAddresses information via S3 fails due to buffer size #266

Closed JeffBolle closed 2 years ago

JeffBolle commented 2 years ago

When attempting to download the OpenAddresses information via S3 with requestor pays, per the following config:

"openaddresses": {
      "dataHost" : "s3://data.openaddresses.io",
      "s3Options": "--request-payer",
            "datapath": "/data/openaddresses",
            "files": []
        }

The download starts, but errors out after a short amount of time with the following error:

$ pelias download oa
info: [openaddresses-download] Attempting to download all data
error: [openaddresses-download] Failed to download data message=stdout maxBuffer length exceeded, stack=RangeError [ERR_CHILD_PROCESS_STDIO_MAXBUFFER]: stdout maxBuffer length exceeded
    at Socket.onChildStdout (child_process.js:368:14)
    at Socket.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:268:11)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at Pipe.onStreamRead (internal/stream_base_commons.js:188:23), code=ERR_CHILD_PROCESS_STDIO_MAXBUFFER, cmd=aws s3 cp s3://data.openaddresses.io/openaddr-collected-global.zip /tmp/202187-1-rcj52w.xrj4s.zip --request-payer

I'm able to separately download the openaddresses files using the s3 client on the box and extract the contents.

Steps to Reproduce Standard Pelias build. git pull, pelias compose pull, etc. Modify pelias.json to have an S3 config for the OpenAddresses data. run pelias download oa

Expected behavior

Environment (please complete the following information):

$ uname -a
Linux pelias-server-tools-builder 5.11.0-1017-gcp #19~20.04.1-Ubuntu SMP Thu Aug 12 05:25:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 20.10.8, build 3967b7d
$ docker-compose --version
docker-compose version 1.25.0, build unknown

Pastebin/Screenshots

Additional context

References

missinglink commented 2 years ago

Thanks for the report, we may need to adapt the code to accommodate:

https://stackoverflow.com/questions/23429499/stdout-buffer-issue-using-node-child-process

missinglink commented 2 years ago

Here's a link to the command itself: https://github.com/pelias/openaddresses/blob/1b60e9725db90a6e319fb82e4efc310a809eaf81/utils/download_all.js#L46

It's using cp which shouldn't output the bytes to stdio, I'm not super familiar with that command when using requester-payer, is it potentially very verbose at logging by default?

missinglink commented 2 years ago

The docs suggest the flags should be --request-payer requester, does that resolve the issue?

missinglink commented 2 years ago

Agh so, by default it displays a progress dialog, you can disable it with a command like this:

aws s3 cp \
  s3://data.openaddresses.io/openaddr-collected-global.zip /tmp/openaddr-collected-global.zip \
  --request-payer requester \
  --no-progress

So you should be able to proceed without any code changes with:

"s3Options": "--request-payer requester --no-progress"
missinglink commented 2 years ago

As a more permanent fix we can either:

missinglink commented 2 years ago

please test the pull request and let us know how you get on: