rivernews / media-literacy

Exploring possibilities of technology empowering media literacy
1 stars 0 forks source link

Fetch individual story pages #15

Closed rivernews closed 1 year ago

rivernews commented 3 years ago

Plan

image

Trigger

Synchronous Logic as Sfn

Additional lambda like aggregation that needs joining all concurrent process like word cloud, etc.

rivernews commented 2 years ago

Short term advancement

Modularizing scraping based on a "scrape pattern"

image
rivernews commented 2 years ago

All above are valuable, but could be over-complicated at this point.

What we want to do is just download stories, just fetch HTML. Not parsing anything, so actually no scraping on story.

Now, where should the fetch logic be placed?

  1. Most optimized way is to integrate with landing page scraper. You are already scraping links there, and posting to slack. Why not fetch the stories there too?
  2. Modularized way is to only read from S3 landing page HTML, almost like offline. You do have to repeat what's done in landing page scraper though - the extract story links part, lot of duplicate logic indeed.

🍓 It seems that 1. way seem better. We want to be, or at least previously we want to be, careful here, because later we could be doing some same logic, once we decide to scrape story page. But look, landing page and story page process could be quite different:

Yes, the Input part is the same - both are URL. But the scraping goal is very different.

We previously were exploring using a single SQS pipeline to handle both process. That's where we start thinking very detail at practical scene, and start 🍓 feeling 2. way may actually be better:

rivernews commented 2 years ago

Break it down more?

  1. Fetch landing page
  2. Parse landing page, store metadata (including story urls) in JSON
  3. Fetch story pages

In this way, you can reuse the "fetching" logic for SQS.

  1. Cronjob: send landing page URL to SQS
  2. SQS-Lambda: fetch the landing page, store at S3
    • S3 dir s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html
  3. S3-event bridge: landing page S3 directory new files (HTML) created, triggers landing page scraper
    • How to determine S3 dir?
  4. Lambda: scrape landing page:
    • extract stories title and link, store them on S3 as JSON
    • landing page metadata JSON created in S3 directory (later used for word cloud grouping story for a day / landing page), trigger fetch story logic
    • (read JSON) send story URLs to SQS

In the future, we can create S3-event bridge for scraping story.

rivernews commented 2 years ago

Now we come up with a cloud-component plan, we need to implement it. Should we either

After rethinking about it, instead of trying to land on an optimized solution, let's leave some redundancy. San terraform code you want to leave it there for POC as well so let's not change that. Just start adding on top of existing stuff.

  1. Let's kick start by read through S3 dir landing page {then set up event when S3 create new landing file}
    • 25

  2. Parse landing page (yes, even if we did, we only post to Slack, did not preserve the outcome, so had to do it over again)
    • Generate landing page metadata JSON. Where should we put in S3 dir?
      • 23

    • (Let fetch story be part of landing page parsing, or? But then there're lots of stories like >100 not great to do it all at once - so better just store stories URL in metadata JSON, then we can take the time to decide what to do; better use different IP, either Sfn mapping or SQS. Sfn Mapping can control concurrency out of the box, but not SQS)
      • 24

      • Let's keep it simple, Sfn mapping or SQS, either works, just choose one first.
rivernews commented 1 year ago

Root pull request (now actually dev): https://github.com/rivernews/media-literacy/pull/28/files We should probably create separate PR for each specific issues.

Next steps

rivernews commented 1 year ago

Closing since

Are all done.