knox-academy / webscraping

0 stars 0 forks source link

Create a webscraper to read RSS feed from bleepingcomputer.com #10

Closed knox-academy closed 1 year ago

knox-academy commented 1 year ago

We need a webscraper/feed reader to read these two feeds: https://www.bleepingcomputer.com/feed/ https://www.bleepingcomputer.com/virus-removal/feed/ And store the 5 most recent articles and the article feed data in a json file. The json files should be Stored in an S3 bucket named: knox-academy-rssfeed-data

knox-academy commented 1 year ago

Mike McConnelly:

  1. Research and select a web scraping tool that can read RSS feeds.
  2. Create a script to scrape the two RSS feeds from bleepingcomputer.com.
  3. Test the script to ensure it can scrape the feeds and extract the necessary data.
  4. Set up an S3 bucket named knox-academy-rssfeed-data to store the JSON files.
  5. Write a script to convert the scraped data into JSON format.
  6. Test the JSON conversion script to ensure it is working correctly.
  7. Implement a system to store the 5 most recent articles from each feed.
  8. Test the system to ensure it is correctly storing the articles.
  9. Set up a schedule for the script to run and scrape the feeds regularly.
  10. Monitor the system to ensure it is running smoothly and making the necessary updates to the JSON files.
knox-academy commented 1 year ago

Dan Carter: I have an objection to issue #1. We should not spend time researching and selecting a web scraping tool when we can use a tool that we are already familiar with. Let's use Beautiful Soup, which our dev team is already trained in.

Mike McConnelly: That's a good point, Dan. Let's update issue #1 to read "Use Beautiful Soup to scrape the RSS feeds."

Dan Carter: I also have an objection to issue #7. Instead of storing the 5 most recent articles from each feed, let's store the 10 most recent articles from both feeds combined. This will give us a more comprehensive view of the latest articles.

Mike McConnelly: That makes sense, Dan. Let's update issue #7 to read "Implement a system to store the 10 most recent articles from both feeds combined."

Dan Carter: Lastly, I have an objection to issue #9. Instead of setting up a schedule for the script to run, let's use a continuous integration tool like Jenkins to automate the scraping and updating process.

Mike McConnelly: That's a great idea, Dan. Let's update issue #9 to read "Automate the scraping and updating process using a continuous integration tool like Jenkins."

knox-academy commented 1 year ago

Mike McConnelly:

  1. Use Beautiful Soup to scrape the RSS feeds.
  2. Create a script to scrape the RSS feeds.
  3. Test the script to ensure it is scraping the correct data.
  4. Parse the scraped data to extract the article information.
  5. Create a JSON file to store the article information.
  6. Test the JSON file to ensure it is storing the correct data.
  7. Implement a system to store the 10 most recent articles from both feeds combined.
  8. Set up an S3 bucket named knox-academy-rssfeed-data to store the JSON files.
  9. Automate the scraping and updating process using a continuous integration tool like Jenkins.