rivernews / media-literacy

Exploring possibilities of technology empowering media literacy
1 stars 0 forks source link

(3/3) Create one-time trigger all historical landing page - fetch all stories #25

Closed rivernews closed 1 year ago

rivernews commented 2 years ago

Better way to run them all

Reference

Proper Throttling

It'll be best to reuse the Sfn, but limit the amount of concurrent sfn execution; overall we should aim at 5~100 concurrent lambdas but nothing more. Ideal if we can throttle <1 request / 2s.

But to truly keep a low profile it's best to span the time across hours, if not days.

Moving forward

Daily cronjob should automatically trigger our new S3-driven pipeline. Any other concerns?

rivernews commented 1 year ago

DynamoDB Modeling

Primary table: just UUID

Landing page table:

Action items

rivernews commented 1 year ago

Test the entire pipeline

rivernews commented 1 year ago

One time batch processing

Better build a tool that would be useful later in the future.

Basically: turn S3 object(s) into a brand new DDB item.

To kick start,

Simplest way to do it?

Avoid writing unnecessary code. This one-time thing is going to be used really rare after this first trigger. Leverage the S3 trigger + "move/copy" feature in S3 bucket. The flow could be like:

rivernews commented 1 year ago

There are quite big cost implication, however we don't know exact the amount of $$ we need to pay yet. But moving forward it's time to think about the fast track issues and cost saving issues. We should have another issue address these, since they are out of scope and no longer about achieving one-time batch processing.

For now, we will disable cronjob and pause the pipeline. Next time, we may copy over the stories to prod for reuse. Once we have the fast track feature https://github.com/rivernews/media-literacy/issues/41, those will be skipped and we won't lose the computation outcome of these days.