rivernews / media-literacy

Exploring possibilities of technology empowering media literacy
1 stars 0 forks source link

Produce to SNS w/ . If it's from homepage then date is for that date's "top news"; otherwise in url there's already story and a date included. #13

Closed rivernews closed 3 years ago

rivernews commented 3 years ago

If we're fetching from home page, we actually don't need to fetch each page at this point, just the home page. SNS will cover "storing" the TODO fetch pages. But this requires SNS set longer retention day, and our development has to race against that - before message expires (14 days max, like SQS), we must finish implementing the part that subscribes and processes messages.

Additional notes

rivernews commented 3 years ago

A new proposal

Overall flow

  1. API request(newsSiteLandingPage) --> parse landing page, archive on S3
  2. Pull landing page from S3, --> extract all story links -->
  3. Given links --> archive all stories on S3 --> extract all stories [rate control] --> store in DynamoDB w/ de-duplications
    • Can be concurrency=1; paginated, linked processing, expected to be a very long process and can't do it parallel.
  4. Generate word cloud after waiting for all stories in landing page finished.
    • Pull from DynamoDB.query(newsSite=newsSite, date=date)
  5. Optional - post to social media

Why splitting 1. and 2.?

Why splitting 2. and 3.?

Splitting 2. and 3. is going to make 4. harder

Still, can some steps be combined and not split while fitting our goal?

rivernews commented 3 years ago

Focus: fetch and send to SQS for later processing

image

With that, we need both fetching the body AND parsing it. Ferret seems not easy to do this out of the box...its DOCUMENT method only accepts URL, not binary. Or maybe we can do a last try - see at its examples... nope, seems nothing related to archiving.

Let's shop another scraper that can match our needs...

Now we switched to GoQuery (which still doesn't have good doc, only one example, lack explanation on usage of most of the function). Next:

rivernews commented 3 years ago

S3 Archiving Improvement

Scraping Parallelism

Options