openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

TypeError: 'NoneType' object is not callable #139

Closed kelson42 closed 1 year ago

kelson42 commented 1 year ago

Exception ignored in: <function MagicDetect.del at 0x7fe40facdb80> Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/magic.py", line 308, in del File "/usr/local/lib/python3.8/site-packages/magic.py", line 135, in close TypeError: 'NoneType' object is not callable

From https://farm.openzim.org/pipeline/7868c92ab725f3774a725236/debug

rgaudin commented 1 year ago

Actually, this is not the cause of the error but a side effect.

[MainThread::2022-09-20 05:40:51,479] INFO:>> Article:Pay-for-Plastic-Surgery
[MainThread::2022-09-20 05:41:21,872] ERROR:Interrupting process due to error: 503 Server Error: first byte timeout for url: https://www.wikihow.com/Pay-for-Plastic-Surgery
[MainThread::2022-09-20 05:41:21,873] ERROR:503 Server Error: first byte timeout for url: https://www.wikihow.com/Pay-for-Plastic-Surgery
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/scraper.py", line 991, in run
    self.scrape_articles()
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/scraper.py", line 513, in scrape_articles
    if not self.scrape_article(article):
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/scraper.py", line 623, in scrape_article
    raise exc
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/scraper.py", line 615, in scrape_article
    soup, _ = get_soup(f"/{article}")
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/utils.py", line 148, in get_soup
    content, paths = fetch(path, **params)
  File "/usr/local/lib/python3.8/site-packages/wikihow2zim-1.2.1-py3.8.egg/wikihow2zim/utils.py", line 77, in fetch
    resp.raise_for_status()
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: first byte timeout for url: https://www.wikihow.com/Pay-for-Plastic-Surgery
[MainThread::2022-09-20 05:41:21,876] DEBUG:shutting down executor IMG-T- with wait=False
[MainThread::2022-09-20 05:41:21,876] DEBUG:shutting down executor VID-T- with wait=False
[MainThread::2022-09-20 05:41:21,876] DEBUG:Removing /output/www.wikihow.com_dii94lnc
[MainThread::2022-09-20 05:41:21,961] DEBUG:Images 126417/126419
Exception ignored in: <function MagicDetect.__del__ at 0x7fe40facdb80>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/magic.py", line 308, in __del__
  File "/usr/local/lib/python3.8/site-packages/magic.py", line 135, in close
TypeError: 'NoneType' object is not callable

503 errors happens from time to time, especially on long-lasting runs. Reopening #122 to at least add a pause+retry on 503 errors instead of just giving up.