Move grouping calculation to lambda-style backend processing

nikolajbaer commented 4 years ago

As we refine the NYCMODA algorithm, it is clear that more processing can yield higher results. Grouping calculations are sparse, so really should be "Just in Time" in terms of server resources.

Thus I think the best structure would be a minimalistic web frontend (currently on heroku) that queues up a grouping calculation in a as-you-need-it backend (e.g. AWS Lambda), and puts the user on a landing page where they can await their results (or provide an email address to get notified when it is complete).

Rough thought on this would be to add the following:

AWS Lambda or Dockerfile that would encapsulate taking in the "JSON" district currently submitted by the website
Add task queue code to server.py so that when this was committed, a new task would be queued on Lambda, and that user woudl be redirecte to a waiting page based on that task's Id
Poll on that waiting page until the task is compete and then show the results, or allow the user to submit an email address associated with to that task, so when the task completes it emails the user the resulting CSV output and link to the "task" detail page of that district.

Probably this will require introducing a backend DB like Redis or Postgres to handle storing the incoming submitted JSON with its task id and email, and generating the landing page.

Some additional considerations:

Evaluate providers, we might be able to get some pro-bono resources for these tasks due to charity, etc
Probably best to track a user's IP Address and limit them to one task-in-process at a time, to limit overall volume of tasks and prevent any overrun in resources (and also maybe cap the overall tasks allowed in any given timeframe just to have an emergency brake on repeated submissions)

nikolajbaer commented 4 years ago

Alternative is to have a secondary process in the Procfile on heroku that runs the long term jobs and read/writes to postgres or redis. So far most of these calculations are pretty low on memory usage and just CPU bound, so this should be a workable alternative that might keep the infrastructure needs simple.

nikolajbaer commented 4 years ago

UPDATE: I have started work on this by creating a lamdba_function.py

The approach should be to have a configurable lambda that the server.py can POST the incoming district JSON to, with a "key" that defines where the result will go. Then the client can poll that location on the S3 bucket (with the appropriate CORS in place) until it shows up, or someone decides to give up :P

Hopefully will push an update soon and get this on the server, so we can drastically improve the optimizations run, and also reduce the hosting needs for heroku on an ongoing basis.

nikolajbaer commented 4 years ago

This is now deployed and functional.

The lambda function is prepared / uploaded with Dockerfile.lambda

You need to setup a function (default name is mealscount-optimize) and the appropriate user credentials to "Invoke" the function, as well as a "results" bucket. The Vue.js client app then receives the "results_url" from the API optimize call and then polls that location (ignoring the 403s) until it gets a result.

NOTE: one possible improvement is to drop a placeholder file in that location so we aren't hitting 403s as we poll.. since red is never nice in the console.

NOTE: to fit a large district like LA in the function "event" invocation limit (256kb) i had to shoehorn in a zipped version in the json data, b64 encoded. This should be enough to handle anything (this is only really an issue with LA and NYC i think), but a future improvement would be maybe to use Dynamo DB?

Current run times at 512GB lambda function are running from ~2min (San Diego) to ~5 min (for LA), with NYCMODA at 50 fresh starts, 1000 iterations (no annealing). This can obviously be scaled up, but users should be warned.

opensandiego / mealscount-backend

Move grouping calculation to lambda-style backend processing #54