Closed openjck closed 5 years ago
We should be sure that writing to S3 is atomic. We don't want to serve up a partially-written file. If the Lambda task fails, we should continue serving up the old file.
I spoke with Blake about this today. We agreed that it doesn't make much sense to spend time setting Lambda/S3 since DataOps is moving to GCP soon anyway.
Instead, we are going to continue to run this as a JSON server on Heroku for now. Blake will write a simple Lambda task that hits this server and stores the results in S3. ensemble can then hit that S3 bucket instead of hitting the server directly.
Once DataOps moves to GCP, we can do what this GitHub Issue actually describes by rewriting this as a Cloud Function that writes directly to Google Cloud Storage.
DataOps isn't ready to manage a GCP setup yet, but will be around the end of Q3 / beginning of Q4. I should drive that process myself, rewiring this project around that time, and open a bug against Data Platform and Tools :: Operations when we're ready for DataOps to manage it.
A significant amount of work has already been done on this. In the process of supporting Redash, I refactored some things which will make this work much easier.
Blake has asked me to rewrite this as a Google Cloud Function that writes to S3 (yes, S3—for now anyway) whenever I'm ready. The next step is for me to get that working locally. Once that's done, I can send him my branch and we can work together to get it hosted.
I spoke with Blake about this today. I'm planning to have this done by the end of February. This plan is still to have a Cloud Function that writes to S3.
I've pushed my progress here and I'm currently working on getting a GCP development setup.
Blake recommended using this code as a guide. This is the script that currently caches ensemble-transposer data on S3. It's a good example of using the AWS SDK.
https://gist.github.com/robotblake/04d3e0795a9c254af896ec317cf7a8dc
I've spoken with Frank and Blake about this approach. They both think it would be better than the current approach.
This project could be rewritten as an AWS Lambda function that writes data to S3 on a regular interval (say, every 24 hours). The advantages include:
The downsides include: