Closed kelson42 closed 4 years ago
@rgaudin Should we close this ticket? Do you think this is outdated after the few fixes you have done.
No I looked at it after the fixes but I doubt it's fixed. Seems like latin-1 encoded string trying to be decoded as UTF-8. Probably due to improper encoding reporting.
We'll have to wait for the first run to decide.
We hit this bug on the zimfarm.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 18: invalid continuation byte
Also, since this happens un a multiprocess Pool
, it doesn't properly return and the task stays idleing forever.
Starting new HTTP connection (1): aleph.gutenberg.org
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/26781/pg26781.epub --output dl-cache/26781.epub
(u'audio/mpeg', False, u'{id}-13.mp3'),
(u'audio/ogg', False, u'{id}-08.ogg'),
(u'audio/mpeg', False, u'{id}-06.mp3'),
(u'audio/ogg', False, u'{id}-10.spx'),
(u'audio/mpeg', False, u'{id}-08.mp3'),
(u'audio/ogg', False, u'{id}-03.ogg'),
(u'audio/ogg', False, u'{id}-06.spx'),
(u'audio/ogg', False, u'{id}-02.ogg'),
(u'audio/ogg', False, u'{id}-12.spx'),
(u'audio/ogg', False, u'{id}-09.spx'),
(u'audio/mp4', False, u'{id}-07.m4b'),
(u'audio/mpeg', False, u'{id}-07.mp3'),
(u'audio/mpeg', False, u'{id}-11.mp3'),
(u'audio/ogg', False, u'{id}-07.ogg'),
(u'audio/ogg', False, u'{id}-11.spx'),
(u'audio/mp4', False, u'{id}-02.m4b'),
(u'application/rdf+xml', False, u'{id}.rdf'),
(u'audio/ogg', False, u'{id}-05.spx'),
(u'audio/ogg', False, u'{id}-12.ogg'),
(u'audio/mpeg', False, u'{id}-10.mp3'),
(u'audio/mpeg', False, u'{id}-12.mp3'),
(u'audio/mp4', False, u'{id}-10.m4b'),
(u'audio/mp4', False, u'{id}-12.m4b'),
(u'audio/mpeg', False, u'{id}-01.mp3'),
(u'audio/ogg', False, u'{id}-13.spx'),
(u'audio/mpeg', False, u'{id}-04.mp3'),
(u'audio/ogg', False, u'{id}-03.spx'),
(u'audio/mp4', False, u'{id}-04.m4b'),
(u'audio/ogg', False, u'{id}-01.spx'),
(u'audio/ogg', False, u'{id}-13.ogg'),
(u'audio/ogg', False, u'{id}-10.ogg'),
(u'audio/ogg', False, u'{id}-11.ogg'),
(u'audio/ogg', False, u'{id}-07.spx'),
(u'audio/mp4', False, u'{id}-08.m4b'),
(u'audio/mp4', False, u'{id}-09.m4b'),
(u'audio/ogg', False, u'{id}-04.spx'),
(u'audio/ogg', False, u'{id}-08.spx'),
(u'audio/mpeg', False, u'{id}-02.mp3'),
(u'audio/mp4', False, u'{id}-01.m4b'),
(u'audio/ogg', False, u'{id}-02.spx'),
(u'audio/ogg', False, u'{id}-06.ogg'),
(u'audio/mp4', False, u'{id}-06.m4b'),
(u'text/html', False, u'{id}-index.html')]
[u'http://aleph.gutenberg.org/etext05/19293-h.htm',
u'http://aleph.gutenberg.org/etext92/19293-h.htm',
u'http://aleph.gutenberg.org/etext99/19293-h.htm',
u'http://aleph.gutenberg.org/etext03/19293-h.htm',
u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.htm',
u'http://aleph.gutenberg.org/etext00/19293-h.htm',
u'http://aleph.gutenberg.org/etext01/19293-h.htm',
u'http://aleph.gutenberg.org/etext90/19293-h.htm',
u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.html',
u'http://aleph.gutenberg.org/etext96/19293-h.htm',
u'http://aleph.gutenberg.org/etext93/19293-h.htm',
u'http://aleph.gutenberg.org/etext02/19293-h.htm',
u'http://aleph.gutenberg.org/etext98/19293-h.htm',
u'http://aleph.gutenberg.org/cache/epub/19293/pg19293.html.utf8',
u'http://aleph.gutenberg.org/etext95/19293-h.htm',
u'http://aleph.gutenberg.org/etext94/19293-h.htm',
u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.zip',
u'http://aleph.gutenberg.org/etext91/19293-h.htm',
u'http://aleph.gutenberg.org/etext04/19293-h.htm',
u'http://aleph.gutenberg.org/etext97/19293-h.htm']
Traceback (most recent call last):
File "/usr/local/bin/gutenberg2zim", line 4, in <module>
Downloading content files for Book #30336
__import__('pkg_resources').run_script('gutenberg2zim==1.1.3.0', 'gutenberg2zim')
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/EGG-INFO/scripts/gutenberg2zim", line 219, in <module>
main(docopt(help, version=VERSION))
File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/EGG-INFO/scripts/gutenberg2zim", line 168, in main
force=FORCE)
File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/gutenbergtozim/download.py", line 228, in download_all_books
Pool(concurrency).map(dlb, available_books)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
http://aleph.gutenberg.org:80 "GET /etext90/18380-h.htm HTTP/1.1" 404 None
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 18: invalid continuation byte
http://aleph.gutenberg.org:80 "GET /etext04/19298-h.html HTTP/1.1" 404 None
Downloading content files for Book #18381
http://aleph.gutenberg.org:80 "GET /etext90/23105.html.noimages HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
Starting new HTTP connection (1): aleph.gutenberg.org
[pdf] not avail. for #20244# Le Voluptueux Voyage
[epub] Requesting URLs for #30336# Hours in a Library, Volume 2
[epub] Requesting URLs for #18381# De Lotgevallen van Tom Sawyer
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /etext99/27711-h.zip HTTP/1.1" 404 None
http://aleph.gutenberg.org:80 "GET /etext01/15769-h.zip HTTP/1.1" 404 None
Think this is gone now 👍
@rgaudin @satyamtg Not sure this is the same rootcause, but the symptom looks really similar. Look at the last scrape log https://farm.openzim.org/pipeline/5eff0ba0a96db4d3a374d0e2/debug.
No, it's different. I've open #132
Running
but the process somehow stops at: