Fetch individual story pages

rivernews commented 3 years ago

Plan

Trigger

[x] #16
[ ] From S3 bucket - landing page triggers

Synchronous Logic as Sfn

[ ] Abandon SQS - lambda processing because polling is too expensive
[ ] #21
[x] #18
- [x] Go way to reuse S3 client, etc? Singleton?
  - Opt1: Instantiate session in main(), and pass it downstream
  - Opt2: Assign a global variable. but be aware of thread safe - people mention using Sync.Once then once.Do.
  - Opt3: Some mentioned func init(), here's an aws doc example. What's its Pros and Cons?
- [x] Pull html from S3 - s3 utilities
- [x] extract all links <-- html
- [x] Multiplexing links into different messageGroupId
- [x] ~~Use send_message()'s acc incremental, randomized delay instead of sleeping in consumer, so we don't get bill the lambda exec time for sleeping. (send_message() can delay up tp 15 minutes)~~ Randomize interval
- [ ] De-dup implementation - for now, just check exist
- [ ] Fetch & archive story! SQS
[x] FIFO, potentially unlimited lambda concurrency Story Parser lambda
[ ] ~~Fetch page into RAM~~
[ ] ~~Store into DynamoDB (optional)~~ - we understand that every read/write to S3 cost $$, but there's no reason to add data-modeling complexity at this stage, especially when we have very limited time. Let's skip DynamoDB part. Just do the S3.
[ ] ~~De-dup~~ It might be necessary to de-dup by story content - simply do a story URL de-dup can be enough. Of course only if we want to do censorship tracing, then we could do MD5 checksum, but only after we did data modeling. Raw HTML contains noise like ads that changes frequently making MD5 useless; sanitized, normalized data will be better for de-dup. But it's more fine-tuned, takes more time to implement. Let's de-prioritize this for now.
- [ ] ~~Phase I: Just do a hard check and stop if already scraped~~
- [ ] ~~Phase II: Determine several attributes, then do MD5 hash to determine duplication~~
[ ] Lastly archive if not duplicated

Additional lambda like aggregation that needs joining all concurrent process like word cloud, etc.

rivernews commented 2 years ago

Short term advancement

[ ] Create reusable scraper module, at the go func level (not lambda level).
- scrape_base.go to start off = archive + parser
- So we can get prepared for scraping stories in landing pages.
[ ] Clean up and amend issues

Modularizing scraping based on a "scrape pattern"

rivernews commented 2 years ago

All above are valuable, but could be over-complicated at this point.

What we want to do is just download stories, just fetch HTML. Not parsing anything, so actually no scraping on story.

Now, where should the fetch logic be placed?

Most optimized way is to integrate with landing page scraper. You are already scraping links there, and posting to slack. Why not fetch the stories there too?
Modularized way is to only read from S3 landing page HTML, almost like offline. You do have to repeat what's done in landing page scraper though - the extract story links part, lot of duplicate logic indeed.

🍓 It seems that 1. way seem better. We want to be, or at least previously we want to be, careful here, because later we could be doing some same logic, once we decide to scrape story page. But look, landing page and story page process could be quite different:

(Fetch landing page)
Landing page scraping: we want the story links and titles. Input= URL == fixed news site homepage
(Fetch story pages)
Story page scraping: we want the story main text content. Input= URL == story links

Yes, the Input part is the same - both are URL. But the scraping goal is very different.

We previously were exploring using a single SQS pipeline to handle both process. That's where we start thinking very detail at practical scene, and start 🍓 feeling 2. way may actually be better:

We want concurrent processing, with the potential goal of switching IPs.
How to organize story HTMLs in S3? By date? Because later we may want to do daily word cloud, so grouping stories together ahead of time could be useful. But then, it'll be hard to de-dup, so stories HTML better store in its own independent location. But then landing page associated stories links we already spent the time to calculate ➡️ maybe store the today stories in a separate JSON is better idea, the JSON can also store other landing page metadata. Problem solved!
We want to take care of all history fetched landing pages. So it makes sense to use SQS and pipe a S3 directory over to let it flow, VS add logic to existing landing page scraper but then only newer landing page stories start get fetched.

rivernews commented 2 years ago

Break it down more?

Fetch landing page
Parse landing page, store metadata (including story urls) in JSON
Fetch story pages

In this way, you can reuse the "fetching" logic for SQS.

Cronjob: send landing page URL to SQS
SQS-Lambda: fetch the landing page, store at S3
- S3 dir s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html
S3-event bridge: landing page S3 directory new files (HTML) created, triggers landing page scraper
- How to determine S3 dir?
Lambda: scrape landing page:
- extract stories title and link, store them on S3 as JSON
- landing page metadata JSON created in S3 directory (later used for word cloud grouping story for a day / landing page), trigger fetch story logic
- (read JSON) send story URLs to SQS

In the future, we can create S3-event bridge for scraping story.

rivernews commented 2 years ago

Now we come up with a cloud-component plan, we need to implement it. Should we either

Modify existing, seeking the minimal change; OR
Start from scratch, reuse golang if necessary.

After rethinking about it, instead of trying to land on an optimized solution, let's leave some redundancy. San terraform code you want to leave it there for POC as well so let's not change that. Just start adding on top of existing stuff.

Let's kick start by read through S3 dir landing page {then set up event when S3 create new landing file}
- 25
Parse landing page (yes, even if we did, we only post to Slack, did not preserve the outcome, so had to do it over again)
- Generate landing page metadata JSON. Where should we put in S3 dir?
  - 23
- (Let fetch story be part of landing page parsing, or? But then there're lots of stories like >100 not great to do it all at once - so better just store stories URL in metadata JSON, then we can take the time to decide what to do; better use different IP, either Sfn mapping or SQS. Sfn Mapping can control concurrency out of the box, but not SQS)
  - 24
  - Let's keep it simple, Sfn mapping or SQS, either works, just choose one first.

rivernews commented 1 year ago

Root pull request (now actually dev): https://github.com/rivernews/media-literacy/pull/28/files We should probably create separate PR for each specific issues.

Next steps

"Optional" enhancement mentioned in https://github.com/rivernews/media-literacy/issues/24
Go broad: word cloud for story title text!
Go deeper: Extracting story text

rivernews commented 1 year ago

Closing since

23
24
25

Are all done.

rivernews / media-literacy