rivernews / media-literacy

Exploring possibilities of technology empowering media literacy
1 stars 0 forks source link

(2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24

Closed rivernews closed 2 years ago

rivernews commented 2 years ago

S3 Notification

In #23 we already did some S3 notification using eventBridge. We want to do the same now for metadata JSON, but it matters a lot where we store this JSON.

Recall landing pages are store like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html. We set prefix filtering & stops at daily-headlines.

If we store JSON at same dir as landing page, like s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing_meta.json, it makes sense for a human, but there's no unique way to filter by a fixed prefix. Or filter by suffix. Not possible.

A workaround could be share the same prefix with landing page filter ...daily-headlines, and do the advance filtering in your lambda. So yes you can't have unique rule separately for landing page and metadata JSON, BUT the purpose is served - either case, a notification is generated. You just need to do some routing in your lambda, resources are still not wasted.

Filtering stories

Because some are not meaningful and we want to exclude. May consider filter at metadata.json generating phase actually -> now added to metadata.json logic, commit in https://github.com/rivernews/media-literacy/pull/28/commits/b6851c567d002c3c7fdf0e49462d8f450ecc75c0

Storing stories

We decided to store stories in its own "store", so like s3://media-literacy-archives/{redacted}/stories/story-IDorTitle/title.html. Then, under that dir you can store story parsing metadata. It can include story update history, etc, first shown up in landing page in what date.

rivernews commented 2 years ago

We decide to use Sfn - map for Scraping courtesy. A bit more expensive, but we'll have more IP options.

rivernews commented 2 years ago

https://github.com/rivernews/media-literacy/pull/29/commits/ba830cd65d7b0e39dab9a72314441563c0320ac5 Marks the last requirements. Enhancement remains undone, but can be revisit later.