Update the Import module to actually download the raw files from AWS

davidpeckham commented 3 years ago

I'd like to take this issue. Boto3 looks straightforward, but I'll need credentials for an S3 user with "programmatic access".

davidpeckham commented 3 years ago

This copies everything in a bucket. If we only need a subset of files in the bucket, perhaps we put that subset in a separate bucket, or add filtering here.

I tested this on my own S3 storage and an IAM user with AmazonS3ReadOnlyAccess.

$ pip install boto3

import boto3
from pathlib import Path

BUCKET_NAME = "nc-campaign-finance-storage"
LOCAL_DIR = Path.cwd() / 'data'

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(BUCKET_NAME)
for obj in bucket.objects.all():
    s3_file = obj.Object()
    local_file = LOCAL_DIR / s3_file.key
    if local_file.exists():
        if local_file.stat().st_size == s3_file.content_length:
            print(f'{s3_file.key} already downloaded')
            continue
    local_file.parent.mkdir(parents=True, exist_ok=True)
    s3_file.download_file(str(local_file))
    print(f'{s3_file.key}')

print("Done")

ChrisTheDBA commented 3 years ago

The change needs to be dynamic to download any and all files not already located in the docker image(a static list of files is not sufficient) and should require elevated privileges requiring AWS secrets.

ncopenpass / CampaignFinanceDataPipeline

Update the Import module to actually download the raw files from AWS #10