rivernews commented 2 years ago

S3 Notification

In #23 we already did some S3 notification using eventBridge. We want to do the same now for metadata JSON, but it matters a lot where we store this JSON.

Recall landing pages are store like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html. We set prefix filtering & stops at daily-headlines.

If we store JSON at same dir as landing page, like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing_meta.json, it makes sense for a human, but there's no unique way to filter by a fixed prefix. Or filter by suffix. Not possible.

A workaround could be share the same prefix with landing page filter ...daily-headlines, and do the advance filtering in your lambda. So yes you can't have unique rule separately for landing page and metadata JSON, BUT the purpose is served - either case, a notification is generated. You just need to do some routing in your lambda, resources are still not wasted.

Filtering stories

Because some are not meaningful and we want to exclude. May consider filter at metadata.json generating phase actually -> now added to metadata.json logic, commit in https://github.com/rivernews/media-literacy/pull/28/commits/b6851c567d002c3c7fdf0e49462d8f450ecc75c0

Storing stories

We decided to store stories in its own "store", so like s3://media-literacy-archives/{redacted}/stories/story-IDorTitle/title.html. Then, under that dir you can store story parsing metadata. It can include story update history, etc, first shown up in landing page in what date.

[x] Metadata.json triggers skeleton computing env: completed by commit https://github.com/rivernews/media-literacy/pull/28/commits/a62c33d602ac2b02728ae9a8e80b09baf76bf3ef
[x] Fetch story: https://github.com/rivernews/media-literacy/pull/29 - https://github.com/rivernews/media-literacy/pull/29/commits/e29fdc8cf0b104c973cc1a0153d4323f2437ce0b
[x] Archive story in appropriate S3 dir
[x] Scraping courtesy
- [x] Parallism. We may do IP inspection - make sure we're using different IP; but parallism probably already equals unique IP so no need to confirm. Completed by https://github.com/rivernews/media-literacy/pull/29/commits/d2b025e86d7353846850b395d662b388a4b4952c
- [x] Randomize time duration. Completed by https://github.com/rivernews/media-literacy/pull/29/commits/ba830cd65d7b0e39dab9a72314441563c0320ac5.
  - A note: we may consider a pre-computed request distribution, perhaps having a step before map in sfn to do this and assign a wait time. Of course we may have to factor in the batch size though.
  - This will summarize PR https://github.com/rivernews/media-literacy/pull/29
[ ] (optional) Add a final step at sfn for closure - telling all randomized works are done. Could provide some stats too.
[ ] (optional) Clean up - remove slack command batch fetch stories

rivernews commented 2 years ago

We decide to use Sfn - map for Scraping courtesy. A bit more expensive, but we'll have more IP options.

rivernews commented 2 years ago

https://github.com/rivernews/media-literacy/pull/29/commits/ba830cd65d7b0e39dab9a72314441563c0320ac5 Marks the last requirements. Enhancement remains undone, but can be revisit later.

rivernews / media-literacy

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

S3 Notification

Filtering stories

Storing stories