time-less-ness / trust-assembly

For SomeGuy's Trust Assembly Project
5 stars 6 forks source link

Scraping News Sites by Date and Article Level #4

Open MelvinSninkle opened 1 month ago

MelvinSninkle commented 1 month ago

Title: Implement Scraping for Fox, CNN, and MSNBC at Article Level Description: Develop a web scraping solution to extract headlines from Fox News, CNN, and MSNBC. Data should be collected by date and at the article level. Leads To: #6 Tasks:

  1. Select a web scraping tool (e.g., BeautifulSoup, Scrapy, Puppeteer) for efficient extraction.
  2. Set up pipelines to collect article URLs, headlines, authors, and publication dates.
  3. Ensure data collection is logged by date and stored at the article level.
  4. Implement error handling and retries for failed attempts (up to three retries).

Acceptance Criteria: • Scraping functions correctly for Fox News, CNN, and MSNBC. • Extracted data includes URLs, headlines, authors, and publication dates. • Data is stored in a structured format with clear logging. • Priority: High

Labels: Backend, Scraping, Data Collection, MVP

chalcolith commented 1 month ago

I could start taking a look at this. I think I would start by writing a technical spec. What tech stack would be appropriate? Most of my backend experience is with .NET in C#, so that would be my first impulse, but if there is a more preferred stack I could get up to speed on that.

MelvinSninkle commented 1 month ago

I do not have a stack preference for this piece. My only concern would be thinking about how we scale this. It seems like this should be a durable pipeline that will work even as we start to manipulate the data differently. Please challenge me on that if I seem to be wrong.