(3/3) Create one-time trigger all historical landing page - fetch all stories

Better way to run them all

Stream-base seems the best. But SQS is hard to use.
- [x] Enqueue at landing page level or story page? Landing page. We may optionally store stories?
- [x] DynamoDB seems like a good option over SQS because it's stable, and our use case really doesn't leverage the strength of SQS tbh.
  - DynamoDB terraform provider has GSI concerns, see: https://github.com/hashicorp/terraform-provider-aws/issues/671; but since GSI is like a redundancy of base table, deleting GSI won't cause data loss on base table. Most of the concern in the issue is interrupting prod database availability when applying the change. Since we're not at that live scale yet so this is not an issue on our side. Plus we are event driven workflow.
  - Create DDB table that serves our purpose, tracked by https://github.com/rivernews/media-literacy/issues/25#issuecomment-1250484288
- [x] Besides data storage we still need a way to "drain" all TODO landing pages + A one-time operation to "enPool" all historic landing page (for Prod this means ALL).
  - Draining Option 1: DynamoDB/anyStoragePool + recursive/chained Sfn. Sfn execution uses landing page timestamp as name.
  - Draining Option 2: cronjob trigger Sfn every hour ... nice thing: fetching spread over a larger time span. https://github.com/rivernews/media-literacy/pull/33
    - [x] S3 landing.html trigger - just write into DDB https://github.com/rivernews/media-literacy/pull/34. Move the metadata computing part into new cronjob below.
      - [x] Create a new lambda, write DDB. Let S3 trigger switch to this lambda
    - [x] Create a cronjob - lambda, which reads from DDB - just take out one landing page URL at this point (but make it extendable to iterate N landing pages), then do the metadata computing.
      - [x] Point new cronjob to the previous (landing) metadata lambda
      - [x] In metadata lambda, switch the source of landingPageUrl, from s3Event to pulling from DDB. (do a query; can limit=1 for our "slow start" purpose) https://github.com/rivernews/media-literacy/pull/35
      - [x] Adding events. This did make it more complicate as we need to figure out how to add to list in DDB.
    - [x] Metadata computing lambda - add update to isMetadataEverComputed
    - [x] Sfn add a step after map - log event "FinishStoriesFetchingAll" into DDB landing page object pipelineEvents.
    - [x] Test the entire flow. Need to come up with a test plan first.
    - [ ] Ready in prod: one time batch processing
- Other idea: purely S3: move processed landing page under another dir?
We may use Slack command as a manual trigger for event driven aside from S3.

Reference

Proper Throttling

It'll be best to reuse the Sfn, but limit the amount of concurrent sfn execution; overall we should aim at 5~100 concurrent lambdas but nothing more. Ideal if we can throttle <1 request / 2s.

But to truly keep a low profile it's best to span the time across hours, if not days.

Moving forward

Daily cronjob should automatically trigger our new S3-driven pipeline. Any other concerns?

[x] Staging drill: parameter all switch to prod; then copy all prod landing page over to dev S3 bucket to test.
- [x] Sfn is not scaling up to 200, why? Because the landing page didn't have enough stories! Look at the db, it's 88 only so.
Improvements - only do low hanging fruit at this point! Don't do complicate task
- Unexpected buggy behavior
  - [x] Why second Sfn spinned up even if we only manually invoke one cronjob for testing? ... maybe deploying cronjob will already trigger one? https://github.com/rivernews/media-literacy/issues/43
  - [x] If stories are more than 100, why Sfn didn't spin up 100 lambdas, and in progress ones are only around 45? ....but maybe it completes too quick? We may increase initial wait to clarify on this. https://github.com/rivernews/media-literacy/issues/44
- Better execution insights
  - [x] Learn Sfn map output shape and pass it to finalizer for access
  - [x] Better event trail of success rate in Sfn results... but we need to know better Sfn map output shape.
- Easier to debug
  - [x] We might get banned by Slack API. Especially, when S3 batch copy you trigger S3 event all at once. Can we not show log from it if it's stable? Or at least not log to Slack (but log to CloudWatch) https://github.com/rivernews/media-literacy/issues/40
  - [ ] Add lambda invocation id (request id) to event description, will help pin the log
  - [ ] Add env tag in log, so we can tell especially in slack, whether it's prod or dev resource. https://github.com/rivernews/media-literacy/issues/42
  - [ ] Can we optimize our log msg?
  - [ ] Better way to query "metadata processed" landing items? Or even better, a field lastEventName to query (but then it could be similar to scan)? Or just a opposite to isDocTypeWaitingForMetadata, like isDocTypeMetadataDone.
  - [ ] Better way to query all stories items associate with a landing item?
  - [ ] Sfn map improvement: pre-determine wait time and put in Sfn input so it's clearer.
- Stronger feature
  - [ ] Fast track - disable change detection for now - story can we de-dup for now? If story html already in html, skip it. It'll significantly boost our first-time processing. https://github.com/rivernews/media-literacy/issues/41
  - [ ] Detect change & censorship: A lot of stories are duplicates [because landing page fetch per 12h and hasn't changed much in between]... do we still fetch them? I guess we better do, but we want to store them all, not overwrite each other.
- Save $$
  - [ ] can we move the "random wait" logic into Sfn wait? Seems Sfn wait doesn't charge?
  - [ ] Can we lower function memory to save? Currently 128 MB provisioned, used 4xMB.
[ ] Do the same above for prod (ready!)

DynamoDB Modeling

Primary table: just UUID

Landing page table:

fetchedTimeStamp
newsSite Alias, etc, or reference to newsSite object model
isMetadataEverComputed <-- this can be used for query

Action items

[x] Use a separate Terraform module for separate of concern: https://github.com/rivernews/media-literacy/pull/30
- [x] Determine a data store to bridge the two module.
[x] Terraform provision a dynamodb table.
- [x] We may create a GSI early on. Currently isMetadataEverComputed is the important one.

Test the entire pipeline

[x] First of all, destroy any tf table resources.
[x] Provision table
[x] Provision media stack
[x] Invoke landing lambda manually in AWS portal, so it can fetch one single landing HTML, and create DDB item.
- [x] Observe DDB item created, values are right, events are there ... https://github.com/rivernews/media-literacy/commit/c24ba2d4e0c4bd5591aa3a80e11864f3f0ad805b, https://github.com/aws/aws-sdk-go-v2/issues/1569 (PR)
[x] Invoke landing_metadata_cronjob lambda manually. Observe DDB query successfully getting landing page item, and output the metadata.
- [x] Confirm DDB landing page item: isDocTypeWaitingForMetadata removed, event added
  - [x] Add index permission to any lambda that uses DDB query; see this SO, aside from table ARN, also have to add index to resource list. https://github.com/rivernews/media-literacy/commit/233ce609e0387a86a963c020961c2c5f43c7d9bb
  - [x] S3 pull error
  - [x] DDB list_append only works with two lists, not one list one item.
  - [x] DDB UpdateItem - must specify both PK and sort key
- [x] Confirm metadata generated in S3
[x] Observe S3 trigger by metadata, launch sfn in stories
- [x] Confirm DDB item event added
[x] Observe Sfn map executed story in parallel, and finalizer lambda stories_finalizer executed
- [x] Confirm DDB story items created, event added
- [x] Confirm DDB landing page item event properly added
- [x] Confirm S3 story stored in proper key directory. Previously there are suspicious stories-... prefix there that shouldn't be there, check the dev bucket.

One time batch processing

Better build a tool that would be useful later in the future.

Basically: turn S3 object(s) into a brand new DDB item.

Generate brand new DDB item: logic already there in landing
Scan S3 dir at scale. To be useful in the future, we better make it flexible
- Because we store headlines as s3://media-literacy-dev-archives/{redacted}/daily-headlines/2021-08-21T00:11:53Z/..., we can probably scan by time range.

To kick start,

[ ] How to design a pipeline that is useful in the future?
- Slack command interface
- Local CLI tool

Simplest way to do it?

Avoid writing unnecessary code. This one-time thing is going to be used really rare after this first trigger. Leverage the S3 trigger + "move/copy" feature in S3 bucket. The flow could be like:

[x] Move DB PutItem from landing to another lambda. S3 trigger landing.html invoke this lambda.
- [x] Fix lambda package too large (> 50MB), have to fix lambda build process. Look at TF Lambda API Doc.
[x] TF Switch to s3:ObjectCreated:Put.
[x] Move all landing page out to a temp place
[x] TF Switch to s3:ObjectCreated:Copy
[x] Move all landing page back to original place -> this will S3 trigger all! But not yet fetch, just put all in DB. You may do this in dev first before going into prod, since no scraping happened yet.
- [x] Check DB, landing page entries number seems right?
[x] We didn't manually invoke metadata cronjob yet, why it fired anyway as soon as we copy the landing files? Because we're copying over metadata.json as well. Now: just copy the landing.html one by one, don't copy by the entire directory!
- [x] When we test, we should just move one single landing dir?

[x] Sfn finalizer failed, why?

🛑 ERROR: operation error DynamoDB: Query, https response error StatusCode: 400, 
RequestID: 653RRF4ESDBN7P72AER3M9DAURVV4KQNSO5AEMVJF66Q9ASUAAJG, api error 
ValidationException: One or more parameter values are not valid. 
A value specified for a secondary index key is not supported. 
The AttributeValue for a key attribute cannot contain an empty string value. 
IndexName: s3KeyIndex, IndexKey: s3Key | 2022/09/30 08:57:47

There are quite big cost implication, however we don't know exact the amount of $$ we need to pay yet. But moving forward it's time to think about the fast track issues and cost saving issues. We should have another issue address these, since they are out of scope and no longer about achieving one-time batch processing.

For now, we will disable cronjob and pause the pipeline. Next time, we may copy over the stories to prod for reuse. Once we have the fast track feature https://github.com/rivernews/media-literacy/issues/41, those will be skipped and we won't lose the computation outcome of these days.

rivernews / media-literacy