openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

wikihow is not retrying to fetch illustrations #164

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Task : https://farm.openzim.org/pipeline/dafaba4f-3b4c-4fa2-a541-e031be49402e/debug (wikihow_fr_maxi)

Logs:

[MainThread::2024-05-06 15:02:06,711] DEBUG:Adding illustrations
[MainThread::2024-05-06 15:17:50,755] ERROR:Interrupting process due to error: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))
[MainThread::2024-05-06 15:17:50,755] ERROR:('Connection aborted.', TimeoutError(110, 'Connection timed out'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/local/lib/python3.8/ssl.py", line 1073, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.8/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/local/lib/python3.8/ssl.py", line 1073, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.8/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.3-py3.8.egg/wikihow2zim/scraper.py", line 948, in run
    self.add_illustrations()
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.3-py3.8.egg/wikihow2zim/scraper.py", line 170, in add_illustrations
    handle_user_provided_file(source=self.conf.icon, dest=src_illus_fpath)
  File "/usr/local/lib/python3.8/site-packages/zimscraperlib/inputs.py", line 40, in handle_user_provided_file
    stream_file(url=str(source), fpath=dest)
  File "/usr/local/lib/python3.8/site-packages/zimscraperlib/download.py", line 197, in stream_file
    resp = session.get(
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

Here it clearly looks like the illustration is fetched from an online source (i.e. it is not specified by the user in the recipe configuration) and the fetch through handle_user_provided_file fails. There is no retry mechanism so the thing just stops the scrape.