openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Gutenberg run failed on zimfarm #126

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

The last gutenberg run (https://farm.openzim.org/pipeline/5ef4da41443d22424a730e05/debug) failed possibly due to a connection error while downloading a book. The scraper shouldn't crash on such errors but should definitely output the error.

This is the last log from the scraper explaining the message -

requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='aleph.gutenberg.org', port=80): Max retries exceeded with url: /5/2/5/5254/5254-pdf.pdf (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fc4d3b67390>, 'Connection to aleph.gutenberg.org timed out. (connect timeout=20)')
kelson42 commented 4 years ago

@satyamtg If the backend does not respond anymore, then the scraper should stop. Seems OK to me. The question is more why the backend does not respond?!

satyamtg commented 4 years ago

@kelson42 the backend server didn't respond for a particular URL (and that can happen due to multiple reasons). But the thing is the scraper didn't even try to move forward. This happened in a step where the scraper checked whether the resource exists or not. Also, the URL that it was trying to access didn't even exist, which made me check the URLs generation part and I found out that if the filtered URL list (from the sync data) was empty, it sent all the combinations generated to try. Maybe it was being used for development (I didn't touch this when I refactored). This further makes this resourse existence checking step unnecessary unless in a development scenario as we always have consistent data from the rsync step. I have commented those out.