Support compressed uploads

ronaldtse commented 3 years ago

This post gives a good introduction to Lambda-based zip file extraction, based on seek (so you don't need to expand the whole archive at once, since a Lambda only has 500MB disk space):

https://alexwlchan.net/2019/02/working-with-large-s3-objects/

The idea here builds on top of #30 - we have a bucket with this structure:

|--- uploads
|--- 20210201000000
       |---  .done
       |--- index.html
       |--- foo
             |--- bar.html
|--- 20210301000000
       |---  .done
       |--- index.html
       |--- foo
             |--- bar.html

This work involves:

Creating a Lambda zip extraction function
Setting this Lambda function to monitor the /uploads/ directory (path) (via SNS, see https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html#notification-how-to-event-types-and-destinations)
Once run, the Lambda zip extraction function will download the latest /uploads/{name}.zip, extract its content locally, then push its contents to /{name}/

During the upload, the MIME types must be properly set. For example, we currently do this:

aws s3 sync dist s3://$S3_BUCKET_NAME --region=$AWS_REGION --delete --no-progress --exclude "*" --include "*.html" --content-type "text/html; charset=utf-8"
aws s3 sync dist s3://$S3_BUCKET_NAME --region=$AWS_REGION --delete --no-progress --exclude "*" --include "*.json" --content-type "application/json; charset=utf-8"
aws s3 sync dist s3://$S3_BUCKET_NAME --region=$AWS_REGION --delete --no-progress --exclude "*" --include "*.jsonld" --content-type "application/ld+json; charset=utf-8"
aws s3 sync dist s3://$S3_BUCKET_NAME --region=$AWS_REGION --delete --no-progress --exclude "*" --include "*.ttl" --content-type "text/turtle; charset=utf-8"
aws s3 sync dist s3://$S3_BUCKET_NAME --region=$AWS_REGION --delete --no-progress --include "*" --exclude "*.html" --exclude "*.json" --exclude "*.jsonld" --exclude "*.ttl"

Once the upload is complete, upload the /{name}/.done so that the Lambda function in #30 knows to update the latest valid copy.

Suppose we wish to upload a 20210401000000.zip.

The process goes:

Upload 20210401000000.zip to /uploads/
Once 20210401000000.zip is uploaded, the Lambda zip extraction function is triggered, and it extracts 20210401000000.zip into a local disk directory. Then, it uploads the content to S3 with correct MIME types.
When the extraction is done, the Lambda function uploads the /{name}/.done file to mark that upload is complete.

Then I am not sure who should do the CloudFront invalidation. There are two choices:

The user to do the invalidation. The user needs to monitor (via polling, e.g. aws s3 ls) that the archive extraction and upload from Lambda is complete (via monitoring the presence of the /{name}/.done file). This has the benefit of ensuring that the GHA build flow only succeeds when the deploy actually succeeds.
Some other Lambda function will perform the CloudFront invalidation when it detects that /{name}/.done is created. But this way the GHA build flow won't know that deployment has failed, because the GHA flow would already succeed when the initial zip archive upload is complete. Unless the user also monitors the deployment until the end, which by then the user can ask CloudFront to invalidate anyway.

Thoughts @phuonghuynh @strogonoff ?

ronaldtse commented 3 years ago

Ping @skalee since this is relevant to you too.

strogonoff commented 3 years ago

This is neat!

strogonoff commented 3 years ago

I’m running tests with S3 uploads to acceleration-enabled and normal S3 buckets, but no strong results yet.

ronaldtse commented 3 years ago

And when the Lambda function uploads to S3, we should use parallel processes for maximum speed (https://github.com/riboseinc/terraform-aws-s3-cloudfront-website/issues/29#issuecomment-789722054).

ronaldtse commented 3 years ago

Module creation now located here: https://github.com/riboseinc/terraform-aws-lambda-s3-archive-extract-upload/issues/1

strogonoff commented 3 years ago

And when the Lambda function uploads to S3, we should use parallel processes for maximum speed (https://github.com/riboseinc/terraform-aws-s3-cloudfront-website/issues/29#issuecomment-789722054).

Just to clarify, parallel uploads I mentioned only apply in the case where we upload individual files, not in the zipped case. As I mentioned, I believe those are mutually exclusive

ronaldtse commented 3 years ago

Yes, terraform-aws-lambda-s3-archive-extract-upload#1 Step 3 describes using concurrency when the Lambda function uploads to S3 after unzipping.

phuonghuynh commented 2 years ago

@ronaldtse should I start implement lambda / find a solution for this one?

ronaldtse commented 2 years ago

@phuonghuynh any solution is fine, as long as it works. Thanks!

riboseinc / terraform-aws-s3-cloudfront-website

Support compressed uploads #31