Closed rivernews closed 2 years ago
We decide to use Sfn - map for Scraping courtesy. A bit more expensive, but we'll have more IP options.
https://github.com/rivernews/media-literacy/pull/29/commits/ba830cd65d7b0e39dab9a72314441563c0320ac5 Marks the last requirements. Enhancement remains undone, but can be revisit later.
S3 Notification
In #23 we already did some S3 notification using eventBridge. We want to do the same now for metadata JSON, but it matters a lot where we store this JSON.
Recall landing pages are store like
s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html
. We set prefix filtering & stops atdaily-headlines
.If we store JSON at same dir as landing page, like
s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing_meta.json
, it makes sense for a human, but there's no unique way to filter by a fixed prefix. Or filter by suffix. Not possible.A workaround could be share the same prefix with landing page filter
...daily-headlines
, and do the advance filtering in your lambda. So yes you can't have unique rule separately for landing page and metadata JSON, BUT the purpose is served - either case, a notification is generated. You just need to do some routing in your lambda, resources are still not wasted.Filtering stories
Because some are not meaningful and we want to exclude. May consider filter at metadata.json generating phase actually -> now added to
metadata.json
logic, commit in https://github.com/rivernews/media-literacy/pull/28/commits/b6851c567d002c3c7fdf0e49462d8f450ecc75c0Storing stories
We decided to store stories in its own "store", so like
s3://media-literacy-archives/{redacted}/stories/story-IDorTitle/title.html
. Then, under that dir you can store story parsing metadata. It can include story update history, etc, first shown up in landing page in what date.