openaq / openaq-fetch

A tool to collect data for OpenAQ platform.
MIT License
85 stars 39 forks source link

Support regional fetchers #1065

Open caparker opened 1 year ago

caparker commented 1 year ago

In order to support regional fetchers we would need to make a few minor updates

  1. Update the cdk to deploy the lamda to a different region and create a queue in that region
  2. The scheduler is currently set up to get the QUEUE_NAME from the env variables. We would keep this as a backup but then allow the deployments to provide a preferred QUEUE_NAME
  3. The deployments are deployed along with the rest of the stack and since we would need information (QUEUE_NAME) from the deployment we would need to either move the deployment config to the manager api (the long term solution) or do that part independent of the cdk deployment.
  4. Transferring data from one region to another would also require some updates. This would just require knowing the region that our bucket is in and the region that we are currently running the lambda in. Doesnt seem like it would be too much work.

Some related issues and info https://stackoverflow.com/questions/73780913/how-to-deploy-the-same-stack-across-multiple-regions-using-aws-cdk https://docs.aws.amazon.com/sns/latest/dg/sns-cross-region-delivery.html https://stackoverflow.com/questions/49707489/how-to-upload-the-file-under-different-region-of-aws-s3-bucket-using-python

russbiggs commented 1 year ago

In the proposal do you envision that each regional deployment would place files in a regional bucket or always place in a single bucket? I imagine there could be some significant cross region costs for putting objects into another regions bucket. e.g. eu-west-1 -> us-east-1. Itll be key to figure out the most cost effective way to get everything into the same region, itll just be figuring out when that occurs.

caparker commented 1 year ago

We would likely want to look deeper into the cost but based on my quick look it comes down to this

A typical file from Japan is about 350K (which could be reduced but more on that later) which at the $0.02 transfer rate would cost about 0.0007 cents per file

Creating each file typically takes from 90 to 260 sec, typically 150 sec and at the typical speed thats about $0.0025 per file, or about 350 times the transfer cost

So if we were to create the same size file but in Tokyo and do it in 15 sec instead of 150 sec and then transfer it to us-east-1 we would be spending about $0.0002512 vs $0.0025, or 10X less.

Scale is important here though so improving this for just one fetcher would save us about $0.055/day and therefor it would take a while to recoup our costs. But if we could trim seconds off of all the lambdas I could see this being a big deal. Or if we were typing up someones connection because transfer rates were so slow.

And finally, we could also reduce the file size for Japan, right now only about 10% of a given file is new data so we could reduce costs if we optimized the file size a bit more.