Open rimidalvk opened 1 year ago
Project Update and Task Decomposition:
Connecting to Google Sheets. - Done
Reading Data (links and Publication time) from the main tab. - Done
Reading data from the Config Table and developing decision-making algorithm - In Progress Period: 1 day
Module for data filtration based on the main rules (from the config) Period: 1 day
Module for scrapper error logging (reading, writing, managing data) Period: 2 days
Module of writing the scrapper run results (start time, end time, errors and etc.) Period: 0.5 days
Module of writing scrapper's result data to the Google Sheets. Period: 0.5 days
Debugging/Refactoring the operation of all modules between each other Period: 1 day
Updates will be provided upon tasks realization.
Update:
Connecting to Google Sheets. - Done Reading Data (links and Publication time) from the main tab. - Done Reading data from the Config Table and developing decision-making algorithm - Done Module for data filtration based on the main rules (from the config) - Done (testing)
Module of writing scrapper's result data to the Google Sheets. - In Progress
Proposed issues for time and planning:
Issue #2: Develop Logger Module
Description: Develop logger.py to handle all logging needs of the application, including cycle start/end times, next cycle plan, IP address, computer name, etc. The logs should be structured as outlined in this Google Sheet: Link.
Issue #3: Develop Basic Scraper Module
Description: Develop a basic scraper module that will be used as a template for the social media scrapers. The scraper should be able to take a URL and return data based on the provided configuration. Make sure to implement measures that mimic human browsing behavior to prevent being blocked by the social media platforms.
Issue #4: Develop LinkedIn Scraper
Description: Using the basic scraper module as a template, develop the linkedin_scraper.py module to handle scraping from LinkedIn.
Issue #5: Develop Reddit Scraper
Description: Using the basic scraper module as a template, develop the reddit_scraper.py module to handle scraping from Reddit.
Issue #6: Develop Medium Scraper
Description: Using the basic scraper module as a template, develop the medium_scraper.py module to handle scraping from Medium.
Issue #7: Develop Google Sheets Integration
Description: Develop google_sheets.py module that will handle all interactions with Google Sheets, including data storage and retrieval.
Issue #8: Develop Configuration File Parser with Google Sheets Integration
Description: Update config.py to fetch and update configuration settings from Google Sheets at the beginning of each cycle.
Issue #9: Link Scrapers to Google Sheets Integration
Description: Connect the social media scrapers to the Google Sheets integration to ensure data flows correctly from the scraper to Google Sheets.
Issue #10: Implement Cycle Management in main.py
Description: Implement the cycle management described in the project workflow within main.py. This includes running scrapers based on the configuration file, planning the next cycle, and logging.
Issue #11: Implement Post Link Fetching and Filtering
Description: Implement the functionality to fetch all links to posts and comments from Google Sheets and filter them according to the rules defined in the configuration to determine which links will need to be scraped. This should be integrated with the cycle management in main.py.