vfedotovs / sslv_web_scraper

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
GNU General Public License v3.0
5 stars 3 forks source link

BUG(WS)- Report email did not arrive #313

Open vfedotovs opened 1 month ago

vfedotovs commented 1 month ago
# DB container DB backup dump size in bytes
130286 Oct  4 06:05 pg_backup_2024_10_04.sql - ok
130421 Oct  5 06:05 pg_backup_2024_10_05.sql - ok
130421 Oct  6 06:05 pg_backup_2024_10_06.sql - NOT OK
130477 Oct  7 06:05 pg_backup_2024_10_07.sql - ok
130477 Oct  8 06:05 pg_backup_2024_10_08.sql - NOT OK  <<< Email did not arrive at 1.40 AM London time

Scrape job is triggered at London time -1 hour

2024-10-07 00:40:01,370 [INFO ] : run_long_task: 82: Running sendgrid_mailer task: using cloud ws file
2024-10-08 00:40:04,386 [INFO ] : run_long_task: 63: Recieved GET request to start scraping job for ogre city
2024-10-08 00:40:04,752 [INFO ] : check_today_cloud_data_file_exist: 122: Searching for cloud files with todays date: 2024-10-08

<< Lambda scrape data file was not available
2024-10-08 00:40:04,752 [INFO ] : check_today_cloud_data_file_exist: 132: File containing today date 2024-10-08 was not found, will try to find local craper file
2024-10-08 00:40:04,771 [INFO ] : check_lst_run_state: 172: File Ogre-raw-data-report-2024-10-08.txt was not found, running  scrape task for Ogre city
2024-10-08 00:40:04,771 [INFO ] : run_long_task: 97: Running scrape_website task will create local ws file
2024-10-08 00:50:10,243 [INFO ] : run_long_task: 99: Running data_formater_main task: using locally scraped file
2024-10-08 00:50:10,250 [INFO ] : run_long_task: 101: Running df_cleaner_main task: using locally scraped file
2024-10-08 00:50:10,254 [INFO ] : run_long_task: 103: Running db_worker_main task: using locally scraped file
(END)

Local scrape job completed with success after 10 min

2024-10-08 00:50:02,663 [INFO ] web_scraper : extract_data_from_url: 124: Started scraping data from message URL 62
2024-10-08 00:50:03,076 [INFO ] web_scraper : extract_data_from_url: 127: Successfully retrieved data from https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/gblkm.html ads_opt_name table
2024-10-08 00:50:04,463 [INFO ] web_scraper : extract_data_from_url: 133: Successfully retrieved data from https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/gblkm.html ads_opt table
2024-10-08 00:50:05,849 [INFO ] web_scraper : extract_data_from_url: 139: Successfully retrieved data from https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/gblkm.html ads_price table
2024-10-08 00:50:07,235 [INFO ] web_scraper : extract_data_from_url: 168: Successfully retrieved data from https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/gblkm.html msg_footer table
2024-10-08 00:50:10,239 [INFO ] web_scraper : scrape_website: 89: Creating file Ogre-raw-data-report.txt copy in data folder
2024-10-08 00:50:10,239 [INFO ] web_scraper : scrape_website: 91: --- Finished web_scraper module ---
vfedotovs commented 1 month ago

More triage is needed

2024-10-08 00:50:10,247 [WARNI] : cloud_data_formater_main: 125: Lambda scraped raw-data file does not exist, failing back to local scraper source file
WARNING:data_format_changer:Lambda scraped raw-data file does not exist, failing back to local scraper source file
2024-10-08 00:50:10,247 [INFO ] : cloud_data_formater_main: 127: Converting to csv format from local scraped raw-data file: data/Ogre-raw-data-report-2024-10-08.txt format
INFO:data_format_changer:Converting to csv format from local scraped raw-data file: data/Ogre-raw-data-report-2024-10-08.txt format
2024-10-08 00:50:10,248 [INFO ] : create_oneline_report: 192: Converting raw-text 12 lines per entry fromat into  1 line per entry csv file format
INFO:data_format_changer:Converting raw-text 12 lines per entry fromat into  1 line per entry csv file format
2024-10-08 00:50:10,248 [INFO ] : create_oneline_report: 194: Reading data from file : data/Ogre-raw-data-report-2024-10-08.txt
INFO:data_format_changer:Reading data from file : data/Ogre-raw-data-report-2024-10-08.txt
2024-10-08 00:50:10,249 [ERROR] : create_oneline_report: 264: Source raw-data text file: data/Ogre-raw-data-report-2024-10-08.txt does not exist

Root couse  ogre_city_data_frame is None and file write/creation fails
2024-10-08 00:50:10,249 [ERROR] : cloud_data_formater_main: 137: ogre_city_data_frame is None
2024-10-08 00:50:10,250 [ERROR] : cloud_data_formater_main: 138: Saving csv format data file pandas_df.csv has failed
2024-10-08 00:50:10,250 [INFO ] : cloud_data_formater_main: 139:  --- Finished data_format_changer module ---

2024-10-08 00:50:10,250 [INFO ] : run_long_task: 101: Running df_cleaner_main task: using locally scraped file
2024-10-08 00:50:10,251 [INFO ] : df_cleaner_main: 323:  --- Started df_cleaner module ---
2024-10-08 00:50:10,252 [INFO ] : df_cleaner_main: 329: Loading pandas_df.csv file.

pandas_df.csv was not created by previous module
2024-10-08 00:50:10,252 [ERROR] : df_cleaner_main: 356: File pandas_df.csv not found

2024-10-08 00:50:10,252 [INFO ] : df_cleaner_main: 358: Loading pandas_df_default.csv file.

Default template file is missing
2024-10-08 00:50:10,253 [ERROR] : df_cleaner_main: 368: pandas_df_default.csv does not exist.
2024-10-08 00:50:10,253 [INFO ] : df_cleaner_main: 371:  --- Completed df_cleaner module ---

2024-10-08 00:50:10,254 [INFO ] : run_long_task: 103: Running db_worker_main task: using locally scraped file
INFO:fastapi:Running db_worker_main task: using locally scraped file
INFO:db_worker: --- Satrting db_worker module ---
INFO:db_worker:Checking if required module file cleaned-sorted-df.csv exits in /
ERROR:db_worker:There was an error opening the file cleaned-sorted-df.csv or file does not exist!
ERROR:    Exception in ASGI application
Traceback (most recent call last):

Crash because db_worker.py does not handle gracefuly missiun file
ERROR:db_worker:There was an error opening the file cleaned-sorted-df.csv or file does not exist!
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/app/wsmodules/db_worker.py", line 109, in check_files
    file = open(file_name, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
--- cut ---
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/main.py", line 104, in run_long_task
    db_worker_main()
  File "/app/wsmodules/db_worker.py", line 59, in db_worker_main
    check_files(requred_files)
  File "/app/wsmodules/db_worker.py", line 115, in check_files
    sys.exit()
SystemExit
INFO:     192.168.144.4:36690 - "GET /run-task/ogre HTTP/1.1" 500 Internal Server Error
vfedotovs commented 1 month ago
<< seems file copy failed 
2024-10-08 00:50:10,239 [INFO ] web_scraper : scrape_website: 89: Creating file Ogre-raw-data-report.txt copy in data folder
<< no check   
2024-10-08 00:50:10,239 [INFO ] web_scraper : scrape_website: 91: --- Finished web_scraper module ---

<< data formater could not create file pandas_df.csv becasue preivios module did not copy file to data folder
2024-10-08 00:50:10,248 [INFO ] : create_oneline_report: 194: Reading data from file : data/Ogre-raw-data-report-2024-10-08.txt
2024-10-08 00:50:10,249 [ERROR] : create_oneline_report: 264: Source raw-data text file: data/Ogre-raw-data-report-2024-10-08.txt does not exist
2024-10-08 00:50:10,249 [ERROR] : cloud_data_formater_main: 137: ogre_city_data_frame is None
2024-10-08 00:50:10,250 [ERROR] : cloud_data_formater_main: 138: Saving csv format data file pandas_df.csv has failed
vfedotovs commented 1 month ago

Affected function:

def create_file_copy() -> None:
    """Creates report file copy in data folder"""
    todays_date = datetime.today().strftime('%Y-%m-%d')
    dest_file = 'Ogre-raw-data-report-' + todays_date + '.txt'
    copy_cmd = 'cp Ogre-raw-data-report.txt local_lambda_raw_scraped_data/' + dest_file
    if not os.path.exists('local_lambda_raw_scraped_data'):
        os.makedirs('local_lambda_raw_scraped_data')
    os.system(copy_cmd)
vfedotovs commented 1 month ago

Proposed improvements

To debug why the file copy might not have completed for the create_file_copy function, you can enhance the function with better error handling, logging, and system call verification. The os.system() call is not the best way to execute shell commands since it doesn't provide clear error messages. Instead, you can use Python's shutil.copy for file operations, or subprocess for executing shell commands, as it provides better control and error reporting.

Steps to Improve and Debug the Function: Use shutil.copy instead of os.system() for copying the file, which is more reliable for Python file operations. Add exception handling to capture and log any errors during file copy. Check if the source file exists before attempting the copy. Add logging to track each step and any errors. Here’s an updated version of the function with these improvements:

import os
import shutil
import logging
from datetime import datetime

# Set up basic logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

def create_file_copy() -> None:
    """Creates a report file copy in the data folder with error handling and logging."""
    try:
        # Set up date and destination file path
        todays_date = datetime.today().strftime('%Y-%m-%d')
        src_file = 'Ogre-raw-data-report.txt'
        dest_file = f'local_lambda_raw_scraped_data/Ogre-raw-data-report-{todays_date}.txt'

        # Check if source file exists
        if not os.path.exists(src_file):
            log.error(f"Source file '{src_file}' does not exist.")
            return

        # Ensure the destination directory exists
        if not os.path.exists('local_lambda_raw_scraped_data'):
            os.makedirs('local_lambda_raw_scraped_data')
            log.info(f"Created directory 'local_lambda_raw_scraped_data'.")

        # Perform the file copy
        shutil.copy(src_file, dest_file)
        log.info(f"Copied '{src_file}' to '{dest_file}' successfully.")

    except Exception as e:
        log.error(f"An error occurred during the file copy: {e}")
vfedotovs commented 1 month ago

Using shutil.copy() is more reliable than os.system() for file operations in Python for several reasons:

  1. Native Python Functionality: shutil.copy() is a high-level function in Python designed specifically for copying files. Since it’s part of Python’s standard library, it operates within Python’s environment, making it cross-platform, safe, and easy to use. os.system(), on the other hand, simply runs shell commands (like cp in Linux). It doesn’t provide much feedback on errors or success and depends on the underlying operating system's shell commands, making it less portable and harder to debug.
  2. Error Handling and Exceptions: shutil.copy() raises Python exceptions (like FileNotFoundError, PermissionError, etc.) when something goes wrong. You can easily catch these exceptions with try/except blocks and log or handle the errors appropriately. os.system() returns only the command’s exit status (0 for success, non-zero for failure). It doesn’t raise exceptions, so you don’t get detailed feedback on what went wrong unless you capture and parse the output yourself. If the copy fails, you may not even know what the specific issue was.