openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

scripting manual source downloads #689

Open andrewharvey opened 6 years ago

andrewharvey commented 6 years ago

There are quite a few sources where you need to manually download fresh data, for these OA provides https://results.openaddresses.io/upload-cache which caches these upstream files on S3.

This is very time consuming and results in OA always lagging behind the upstream source.

What do people think about trying to automate this? I'm thinking of a Node script using https://github.com/GoogleChrome/puppeteer for each source where this is needed.

I'm happy to work on the puppeteer scripts but we'd need machine to actually run these. What do people think about this?

If not, then what do people think about a change to the https://results.openaddresses.io/upload-cache to have it produce a curl command line you can run instead of uploading files through the browser.

My workaround for slow upload speeds is to do things on a remote server which means running this script in the Console when logged into https://results.openaddresses.io/upload-cache.

function curlCommand(file) {
    var form = new FormData(document.querySelector('form[action="https://s3.amazonaws.com/data.openaddresses.io"]'));
    var curl = "curl -v -X POST"
        for (var pair of form.entries()) {
            curl += " -F '" + pair[0] + '=' + pair[1] + "'";
        }
    curl += ' https://s3.amazonaws.com/data.openaddresses.io'
    curl = curl.replace('[object File]', '@' + file);
    curl = curl.replace('${filename}', file);
    return curl;
}