openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 9 forks source link

Timeouts and bad exception handling #132

Closed kelson42 closed 3 months ago

kelson42 commented 2 years ago

https://farm.openzim.org/pipeline/68e01011b365858ae1ec8326/debug

[ted2zim::2022-03-21 19:24:28,682] DEBUG:Using h264 resource link for bitrate=1200 [ted2zim::2022-03-21 19:24:28,684] DEBUG:Successfully inserted video 267 into video list [ted2zim::2022-03-21 19:24:28,684] DEBUG:Seen /talks/arthur_ganson_moving_sculpture?language=en [ted2zim::2022-03-21 19:24:28,684] DEBUG:extract_info_from_video_page: https://ted.com/talks/alisa_miller_how_the_news_distorts_our_worldview?language=en [ted2zim::2022-03-21 19:24:30,752] DEBUG:Using h264 resource link for bitrate=1200 [ted2zim::2022-03-21 19:24:30,760] DEBUG:Successfully inserted video 248 into video list [ted2zim::2022-03-21 19:24:30,760] DEBUG:Seen /talks/alisa_miller_how_the_news_distorts_our_worldview?language=en [ted2zim::2022-03-21 19:24:30,760] DEBUG:extract_info_from_video_page: https://ted.com/talks/michael_moschen_juggling_as_art_and_science?language=en [ted2zim::2022-03-21 19:40:58,154] ERROR:FAILED. An error occurred: ('Connection aborted.', TimeoutError(110, 'Connection timed out')) [ted2zim::2022-03-21 19:40:58,154] ERROR:('Connection aborted.', TimeoutError(110, 'Connection timed out')) Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn conn.connect() File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 414, in connect self.sock = ssl_wrapsocket( File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/local/lib/python3.8/ssl.py", line 500, in wrap_socket return self.sslsocket_class._create( File "/usr/local/lib/python3.8/ssl.py", line 1040, in _create self.do_handshake() File "/usr/local/lib/python3.8/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn conn.connect() File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 414, in connect self.sock = ssl_wrapsocket( File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/usr/local/lib/python3.8/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/local/lib/python3.8/ssl.py", line 500, in wrap_socket return self.sslsocket_class._create( File "/usr/local/lib/python3.8/ssl.py", line 1040, in _create self.do_handshake() File "/usr/local/lib/python3.8/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/entrypoint.py", line 190, in main scraper.run() File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/scraper.py", line 1041, in run if not self.extract_videos_from_topics(topic): File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/scraper.py", line 281, in extract_videos_from_topics total_videos_scraped = self.generate_search_result_and_scrape( File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/scraper.py", line 262, in generate_search_result_and_scrape nb_videos_extracted, nb_videos_on_page = self.extract_videos_on_topic_page( File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/scraper.py", line 419, in extract_videos_on_topic_page if self.extract_info_from_video_page(url): File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/scraper.py", line 621, in extract_info_from_video_page soup = BeautifulSoup(download_link(url).text, features="html.parser") File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.10-py3.8.egg/ted2zim/utils.py", line 37, in download_link req = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}) File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 75, in get return request('get', url, params=params, kwargs) File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, kwargs) File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError(110, 'Connection timed out'))

benoit74 commented 3 months ago

HTTP timeouts happens, that's life. Bad exceptions handling should not be the case anymore now that backoff is properly implemented. I don't see what's left to be done in this issue, could someone give me a hint or should we close this?