vfedotovs / sslv_web_scraper

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
GNU General Public License v3.0
5 stars 3 forks source link

BUG(ws): scrape job did not complete and report email did not arrive #312

Open vfedotovs opened 1 month ago

vfedotovs commented 1 month ago

Affected version: 1.5.6

root@ff3f174badf5:/# ls -l local_lambda_raw_scraped_data
total 176
-- cut --
-rw-r--r-- 1 root root 15755 Sep 29 00:38 Ogre-raw-data-report-2024-09-27T00-29-39.txt
-rw-r--r-- 1 root root 15749 Sep 28 00:45 Ogre-raw-data-report-2024-09-28.txt
-rw-r--r-- 1 root root 15751 Sep 29 00:45 Ogre-raw-data-report-2024-09-29.txt

<<< missing 01 sep

-rw-r--r-- 1 root root 16412 Oct  2 00:38 Ogre-raw-data-report-2024-09-30T00-29-48.txt 
-rw-r--r-- 1 root root 16401 Oct  2 00:46 Ogre-raw-data-report-2024-10-02.txt
root@ff3f174badf5:/#

2024-10-01 00:38:56,386 [INFO ] web_scraper : remove_old_file: 115: The file Ogre-raw-data-report.txt does not exist in the current directory.
INFO:web_scraper:The file Ogre-raw-data-report.txt does not exist in the current directory.
INFO:     192.168.144.4:59736 - "GET /run-task/ogre HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f2b636bc7c0>: Failed to establish a new connection: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.ss.lv', port=443): Max retries exceeded with url: /lv/real-estate/flats/ogre-and-reg/ogre/sell/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2b636bc7c0>: Failed to establish a new connection: [Errno 113] No route to host'))
vfedotovs commented 1 month ago

Some proposal triage still needed:

import requests
from requests.exceptions import ConnectionError, Timeout, RequestException
import logging

log = logging.getLogger(__name__)

def fetch_data_from_url(url):
    try:
        response = requests.get(url, timeout=10)  # Add a timeout for robustness
        response.raise_for_status()  # Raise an error if status code is not 200
        return response.content
    except ConnectionError as e:
        log.error(f"Connection error occurred: {e}")
        # Handle the case where the host is unreachable
        return "Connection error. Host unreachable."
    except Timeout:
        log.error("The request timed out.")
        return "Request timed out."
    except RequestException as e:
        log.error(f"An error occurred while making the request: {e}")
        return "An error occurred while fetching data."