knox-academy / webscraping

0 stars 0 forks source link

Need to create a python script to scrape hacker news daily #21

Closed knox-academy closed 1 year ago

knox-academy commented 1 year ago

Need to create a python script to scrape hacker news daily. It should save the data in json format. Not saved duplicates and use gthub actions running on a schedule. And save to an s3 bucket.

knox-academy commented 1 year ago

Mike McConnelly:

  1. Research and select the appropriate Python libraries for web scraping and JSON formatting.
  2. Develop a script that can scrape Hacker News daily and save the data in JSON format.
  3. Implement a function to check for and avoid saving duplicate data.
  4. Set up GitHub Actions to run the script on a schedule.
  5. Configure the script to save the scraped data to an S3 bucket.
  6. Test the script thoroughly to ensure it is functioning as intended.
  7. Document the script and its functionality for future reference and maintenance.
  8. Create a user guide for the script to assist with onboarding new team members.
knox-academy commented 1 year ago

Dan Carter:

For issue 1, do we have any specific criteria for selecting the libraries? Should we consider factors such as popularity, ease of use, or compatibility with other tools we are using?

For issue 2, do we have any specific requirements for the format of the JSON data? Should we include all available information from the Hacker News website, or only select fields?

For issue 3, do we have any specific criteria for determining what constitutes duplicate data? Should we compare based on the entire article or only certain fields?

For issue 4, do we have any specific schedule in mind for running the script? Should we consider factors such as server load or peak usage times?

For issue 5, do we have any specific requirements for the S3 bucket? Should we consider factors such as security, accessibility, or cost?

For issue 6, do we have any specific testing criteria in mind? Should we consider factors such as edge cases, error handling, or performance?

For issue 7, do we have any specific documentation standards in place? Should we consider factors such as readability, completeness, or version control?

For issue 8, do we have any specific user guide requirements? Should we consider factors such as audience, language, or format?

knox-academy commented 1 year ago

Mike McConnelly:

  1. Issue 1: Research and select appropriate libraries for the Python script based on factors such as popularity, ease of use, and compatibility with other tools we are using.
  2. Issue 2: Determine the specific requirements for the format of the JSON data, including which fields to include from the Hacker News website.
  3. Issue 3: Establish criteria for determining what constitutes duplicate data and implement a method for identifying and removing duplicates.
  4. Issue 4: Determine a schedule for running the script, taking into account factors such as server load and peak usage times.
  5. Issue 5: Establish specific requirements for the S3 bucket, including factors such as security, accessibility, and cost.
  6. Issue 6: Develop testing criteria for the Python script, including edge cases, error handling, and performance.
  7. Issue 7: Establish documentation standards for the Python script, including factors such as readability, completeness, and version control.
  8. Issue 8: Develop a user guide for the Python script, taking into account factors such as audience, language, and format.