Gracefully handling connection interruptions while uploading zip files

nathanielrindlaub commented 1 year ago

In a couple cases I tried to simulate a wifi connection interruption by starting an upload, but before the front end had time to fully upload it I turned off my wifi. If I turned the wifi back on with the Animl browser window and the upload dialog still open, it picked up the upload where it left off and successfully processed the batch (which is great), but if I closed the browser window basically the Batch record got created in the DB and the batch stack got created, but it never was able to unzip the file because it never got the full thing. Eventually the stack was torn down (I'm guessing by the scheduled deletion), but the Batch record never received a processingEnd datetime, so from the front end when I checked the next day it looked like it was still trying to process.

I guess that leads me to two questions:

does it make sense (or is it even possible) to wait until the zip has been completely uploaded to the s3 ingestion bucket before triggering ingest-zip?

and 2. if a batch stack gets deleted due to a scheduled run (i.e., something went wrong and the batch didn't get torn down from the cloudwatch metrics alarm), can we update the batch record in the DB with processingEnd?

ingalls commented 1 year ago

@nathanielrindlaub Unfortunately the root cause here is likely beyond our control - The Event is fired when S3 decides that it has the entire file so initial processing of the batch is tied to S3's understanding of when a file is "done".

If we wanted to speed up reaching an error state the most likely way of doing this would be to "test" parse the zip entry, making sure that it is valid/readable before going on to create the stack and then actually process the contents of the zip.

Right now we download the "zip", create the stack, and then once the stack is created we open the zip for reading.

In a subsequent phase we could add the pre-parsing to avoid the stack having to be created on bad zips.

nathanielrindlaub commented 1 year ago

Ok roger that - I think the pre-parsing would be a nice-to-have so happy to kick that to a future phase.

What do you think about question 2 though? How big of a lift would it be for the Ingest-delete lambda to make an updateBatch call to the API to add processingEnd if a stack is getting deleted via a scheduled run.

ingalls commented 1 year ago

@nathanielrindlaub Just looking at the codebase - It looks like this call is already present (https://github.com/tnc-ca-geo/animl-ingest/blob/master/ingest-delete/task.js#L48-L67) but will only be called if the CloudFormation delete is successful.

Are you seeing this consistently not populated or a once off?

tnc-ca-geo / animl-ingest

Gracefully handling connection interruptions while uploading zip files #49