pelias / openaddresses

Pelias import pipeline for OpenAddresses.
MIT License
51 stars 43 forks source link

download: cURL "collected" files in series #485

Closed missinglink closed 3 years ago

missinglink commented 3 years ago

As per discussion in https://github.com/pelias/openaddresses/issues/484 this PR changes the "collected" OA downloads (ie. openaddr-collected-global.zip and openaddr-collected-global-sa.zip) to run in series rather than in parallel.

The reason for this change is that the OA CDN has a "Maximum Connections Per IP" limit of 1.

Prior to this PR, cURL would intermittently receive an HTML file containing the text 503 Service Unavailable, when unzip attempted to open this file it would error the cryptic message End-of-central-directory signature not found.

The positive effect of this PR is that the downloads will no longer only succeed intermittently, the negative effect is that downloads will be slower since the second file isn't started until the first has complete.

I noticed that the "filtered download" (ie. where the user selects only a subset of the OA database) code is already using async.series().

Hopefully in the future we can rework this a bit and return to parallel downloads, the financial costs of hosting these downloads at scale can be significant, and abuse is widespread, so I understand the need for the IP limits.

resolves https://github.com/pelias/openaddresses/issues/484