Upload to JCOIN package output to S3

mradamcox commented 1 week ago

Using the existing jcoin create-data-package command, add a new flag that uploads the zipped output (created if the --zip argument is used) directly to S3. Follow the pattern that is in for S3 uploads in the clients/census.py operations. Use the same S3 bucket, and place within the /oeps/ prefix.

bucketteOfIvy commented 1 week ago

I made use of a (very lightly modified) version of the upload_to_s3 function from utils.py in order to follow the pattern from the census.py upload. The main potential problem I see with this is that the key for the S3 hosted jcoin data is dependent on the destination the jcoin data is being sent to, as most direct additions of a purely hardcoded key variable would break the other usage of upload_to_s3.

I'm not sure that we do want a hardcoded name, but if we do, there's a few possible approaches I see (although they all feel sufficiently far afield of the direct task that I'd want feedback before implementing any of them):

_Overhaul upload_to_s3 to handle the cases where it is passed a single path versus a list of paths separately._ This is doable but likely to result in a hard to read function.
Create a second S3 util function for handling single file uploads. This is also doable with minor refactoring -- the current upload_to_s3 could be renamed upload_files_to_s3 to differentiate the two functions. This likely results in the most readable code, so I think I lean towards it.
Handle this S3 interaction in clients/jcoin.py directly. This is plausible but would feel contradictory to the current organization of the Flask CLI.

mradamcox commented 1 week ago

Ok, I think we should handle this by getting more opinionated about where the exports go and how they are named. For now:

If no -d/--destination is specified (it is an optional argument, even though I didn't treat it as such...), the data package will go within current_app.config['CACHE_DIR'] / 'data-packages', and it will be named oeps-data-package-v2_YYYY-MM-DD.
- If the "no foreign keys flag is used", then append _no_foreign_keys to this name

Default behavior should be to overwrite an existing export with the same name (i.e. from the same day). Or, with a little more work, default behavior would check if the output directory exists and prompt user input to confirm overwrite, while an --overwrite flag would skip that prompt. I don't see that as crucial but may be nice to add. The prompt could go in utils and be reused, etc.

For upload_to_s3 let's modify to accept one path or a list of paths, and use isinstance(paths, list) to check and turn a single path into a list.

Overall, do you think this approach would work?

bucketteOfIvy commented 1 week ago

Added the default destination bit and created an --overwrite flag! Current behavior is that if there is overwrite risk (i.e. the destination exists and is non-empty) the user is prompted.

healthyregions / oeps

Upload to JCOIN package output to S3 #33