openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
127 stars 37 forks source link

UnicodeDecodeError: invalid continuation byte #71

Closed kelson42 closed 4 years ago

kelson42 commented 5 years ago

Running

./gutenberg2zim

but the process somehow stops at:

[epub] Requesting URLs for #24010# The Gods are Athirst
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/24010/pg24010.epub --output dl-cache/24010.epub
    Downloading content files for Book #24810
[epub] Requesting URLs for #24810# The Better Germany in War Time: Being Some Facts Towards Fellowship
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/24810/pg24810.epub --output dl-cache/24810.epub
http://aleph.gutenberg.org:80 "GET /etext00/27595-h.htm HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
    Downloading content files for Book #26657
[epub] Requesting URLs for #26657# The Motor Pirate
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/26657/pg26657.epub --output dl-cache/26657.epub
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/risefreeman.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
    Downloading content files for Book #15765
[epub] Requesting URLs for #15765# Kaukonäkijä: eli kuvauksia Ruijasta
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/15765/pg15765.epub --output dl-cache/15765.epub
http://aleph.gutenberg.org:80 "GET /etext94/27595-h.htm HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/libertyball.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
[pdf] not avail. for #16627# Angelic Wisdom Concerning the Divine Love and the Divine Wisdom
[html] Requesting URLs for #16627# Angelic Wisdom Concerning the Divine Love and the Divine Wisdom
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/16627/pg16627.html.utf8 --output dl-cache/16627.html
http://aleph.gutenberg.org:80 "GET /etext01/27595-h.htm HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/weareallchildren.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
    Downloading content files for Book #25662
[epub] Requesting URLs for #25662# A report on the feasibility and advisability of some policy to inaugurate a system of rifle practice throughout the public schools of the country
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/25662/pg25662.epub --output dl-cache/25662.epub
http://aleph.gutenberg.org:80 "GET /etext97/27595-h.htm HTTP/1.1" 404 None
NO FILE FOR #27595/html
[u'http://aleph.gutenberg.org/etext97/27595-h.htm',
 u'http://aleph.gutenberg.org/etext01/27595-h.htm',
 u'http://aleph.gutenberg.org/etext94/27595-h.htm',
 u'http://aleph.gutenberg.org/etext00/27595-h.htm',
 u'http://aleph.gutenberg.org/etext02/27595-h.htm',
 u'http://aleph.gutenberg.org/2/7/5/9/27595/27595-h.zip',
 u'http://aleph.gutenberg.org/etext99/27595-h.htm',
 u'http://aleph.gutenberg.org/etext90/27595-h.htm',
 u'http://aleph.gutenberg.org/etext93/27595-h.htm',
 u'http://aleph.gutenberg.org/etext96/27595-h.htm',
 u'http://aleph.gutenberg.org/etext95/27595-h.htm',
 u'http://aleph.gutenberg.org/etext03/27595-h.htm',
 u'http://aleph.gutenberg.org/etext04/27595-h.htm',
 u'http://aleph.gutenberg.org/etext05/27595-h.htm',
 u'http://aleph.gutenberg.org/etext98/27595-h.htm',
 u'http://aleph.gutenberg.org/etext92/27595-h.htm',
 u'http://aleph.gutenberg.org/2/7/5/9/27595/27595-h.html',
 u'http://aleph.gutenberg.org/cache/epub/27595/pg27595.html.utf8',
 u'http://aleph.gutenberg.org/etext91/27595-h.htm',
 u'http://aleph.gutenberg.org/2/7/5/9/27595/27595-h.htm']
    Downloading content files for Book #27596
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/comfort.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/freedomsgathering.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/monarch.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/stranger.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/break.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/clarion.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/slaveswrongs.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
[epub] Requesting URLs for #27596# 隋唐嘉話
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/27596/pg27596.epub --output dl-cache/27596.epub
[pdf] not avail. for #24010# The Gods are Athirst
[html] Requesting URLs for #24010# The Gods are Athirst
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/4/0/1/24010/24010-h.zip --output dl-cache/24010.html.zip
[pdf] not avail. for #26657# The Motor Pirate
[html] Requesting URLs for #26657# The Motor Pirate
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/6/6/5/26657/26657-h.zip --output dl-cache/26657.html.zip
[pdf] not avail. for #28370# Nouvelle géographie universelle (1/19)
[html] Requesting URLs for #28370# Nouvelle géographie universelle (1/19)
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/8/3/7/28370/28370-h.zip --output dl-cache/28370.html.zip
    Downloading content files for Book #29033
[epub] Requesting URLs for #29033# Critical Miscellanies (Vol. 3 of 3), Essay 10: Auguste Comte
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/29033/pg29033.epub --output dl-cache/29033.epub
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/yespirits.pdf HTTP/1.1" 404 None
    Downloading content files for Book #19163
Starting new HTTP connection (1): aleph.gutenberg.org
[epub] Requesting URLs for #19163# Märchen für Kinder
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/19163/pg19163.epub --output dl-cache/19163.epub
[pdf] not avail. for #25662# A report on the feasibility and advisability of some policy to inaugurate a system of rifle practice throughout the public schools of the country
[html] Requesting URLs for #25662# A report on the feasibility and advisability of some policy to inaugurate a system of rifle practice throughout the public schools of the country
[pdf] not avail. for #24810# The Better Germany in War Time: Being Some Facts Towards Fellowship
[html] Requesting URLs for #24810# The Better Germany in War Time: Being Some Facts Towards Fellowship
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/5/6/6/25662/25662-h.zip --output dl-cache/25662.html.zip
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/4/8/1/24810/24810-h.zip --output dl-cache/24810.html.zip
[pdf] not avail. for #15765# Kaukonäkijä: eli kuvauksia Ruijasta
[html] Requesting URLs for #15765# Kaukonäkijä: eli kuvauksia Ruijasta
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/15765/pg15765.html.utf8 --output dl-cache/15765.html
    Downloading content files for Book #16628
[epub] Requesting URLs for #16628# Punch, or the London Charivari, Volume 159, August 4th, 1920
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/16628/pg16628.epub --output dl-cache/16628.epub
[pdf] not avail. for #17418# The Black Pearl
[html] Requesting URLs for #17418# The Black Pearl
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/1/7/4/1/17418/17418-h.zip --output dl-cache/17418.html.zip
[pdf] not avail. for #20115# A Short History of the 6th Division: Aug. 1914-March 1919
[html] Requesting URLs for #20115# A Short History of the 6th Division: Aug. 1914-March 1919
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/0/1/1/20115/20115-h.zip --output dl-cache/20115.html.zip
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/lawoflove.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
[pdf] not avail. for #27596# 隋唐嘉話
[html] Requesting URLs for #27596# 隋唐嘉話
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/27596/pg27596.html.utf8 --output dl-cache/27596.html
    Downloading content files for Book #25663
[epub] Requesting URLs for #25663# Printers' Marks
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/25663/pg25663.epub --output dl-cache/25663.epub
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/yesonsoffreemen.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
[pdf] not avail. for #29033# Critical Miscellanies (Vol. 3 of 3), Essay 10: Auguste Comte
[html] Requesting URLs for #29033# Critical Miscellanies (Vol. 3 of 3), Essay 10: Auguste Comte
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/2/9/0/3/29033/29033-h.zip --output dl-cache/29033.html.zip
[pdf] not avail. for #16628# Punch, or the London Charivari, Volume 159, August 4th, 1920
[html] Requesting URLs for #16628# Punch, or the London Charivari, Volume 159, August 4th, 1920
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/1/6/6/2/16628/16628-h.zip --output dl-cache/16628.html.zip
    Downloading content files for Book #26658
[epub] Requesting URLs for #26658# Celebrated Travels and Travellers, Part 3.
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/childisgone.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/26658/pg26658.epub --output dl-cache/26658.epub
    Downloading content files for Book #20116
[epub] Requesting URLs for #20116# The Belief in Immortality and the Worship of the Dead, Volume 1 (of 3)
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/20116/pg20116.epub --output dl-cache/20116.epub
[pdf] not avail. for #19163# Märchen für Kinder
[html] Requesting URLs for #19163# Märchen für Kinder
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/1/9/1/6/19163/19163-h.zip --output dl-cache/19163.html.zip
    Downloading content files for Book #24811
[epub] Requesting URLs for #24811# Viking Tales
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/24811/pg24811.epub --output dl-cache/24811.epub
    Downloading content files for Book #29725
Traceback (most recent call last):
  File "./gutenberg2zim", line 214, in <module>
    main(docopt(help, version=0.1))
  File "./gutenberg2zim", line 167, in main
    force=FORCE)
    Downloading content files for Book #15766
  File "/media/kelson/SOTOKI/gutenberg/gutenbergtozim/download.py", line 226, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 253, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 18: invalid continuation byte
[epub] Requesting URLs for #15766# The Claverings
[epub] Requesting URLs for #29725# The Fairchild Family
    Downloading content files for Book #27597
[epub] Requesting URLs for #27597# The English Utilitarians, Volume 1 (of 3)
    Downloading content files for Book #29034
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/15766/pg15766.epub --output dl-cache/15766.epub
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/29725/pg29725.epub --output dl-cache/29725.epub
[epub] Requesting URLs for #29034# Harper's Young People, July 13, 1880
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/27597/pg27597.epub --output dl-cache/27597.epub
    Downloading content files for Book #17419
[epub] Requesting URLs for #17419# Bouddha
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/29034/pg29034.epub --output dl-cache/29034.epub
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/17419/pg17419.epub --output dl-cache/17419.epub
http://aleph.gutenberg.org:80 "GET /cache/epub/22089/heardye.pdf HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
kelson42 commented 4 years ago

@rgaudin Should we close this ticket? Do you think this is outdated after the few fixes you have done.

rgaudin commented 4 years ago

No I looked at it after the fixes but I doubt it's fixed. Seems like latin-1 encoded string trying to be decoded as UTF-8. Probably due to improper encoding reporting.

We'll have to wait for the first run to decide.

rgaudin commented 4 years ago

We hit this bug on the zimfarm.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 18: invalid continuation byte

Also, since this happens un a multiprocess Pool, it doesn't properly return and the task stays idleing forever.

Starting new HTTP connection (1): aleph.gutenberg.org
curl --fail --insecure --location --silent --show-error -C - --url http://aleph.gutenberg.org/cache/epub/26781/pg26781.epub --output dl-cache/26781.epub
 (u'audio/mpeg', False, u'{id}-13.mp3'),
 (u'audio/ogg', False, u'{id}-08.ogg'),
 (u'audio/mpeg', False, u'{id}-06.mp3'),
 (u'audio/ogg', False, u'{id}-10.spx'),
 (u'audio/mpeg', False, u'{id}-08.mp3'),
 (u'audio/ogg', False, u'{id}-03.ogg'),
 (u'audio/ogg', False, u'{id}-06.spx'),
 (u'audio/ogg', False, u'{id}-02.ogg'),
 (u'audio/ogg', False, u'{id}-12.spx'),
 (u'audio/ogg', False, u'{id}-09.spx'),
 (u'audio/mp4', False, u'{id}-07.m4b'),
 (u'audio/mpeg', False, u'{id}-07.mp3'),
 (u'audio/mpeg', False, u'{id}-11.mp3'),
 (u'audio/ogg', False, u'{id}-07.ogg'),
 (u'audio/ogg', False, u'{id}-11.spx'),
 (u'audio/mp4', False, u'{id}-02.m4b'),
 (u'application/rdf+xml', False, u'{id}.rdf'),
 (u'audio/ogg', False, u'{id}-05.spx'),
 (u'audio/ogg', False, u'{id}-12.ogg'),
 (u'audio/mpeg', False, u'{id}-10.mp3'),
 (u'audio/mpeg', False, u'{id}-12.mp3'),
 (u'audio/mp4', False, u'{id}-10.m4b'),
 (u'audio/mp4', False, u'{id}-12.m4b'),
 (u'audio/mpeg', False, u'{id}-01.mp3'),
 (u'audio/ogg', False, u'{id}-13.spx'),
 (u'audio/mpeg', False, u'{id}-04.mp3'),
 (u'audio/ogg', False, u'{id}-03.spx'),
 (u'audio/mp4', False, u'{id}-04.m4b'),
 (u'audio/ogg', False, u'{id}-01.spx'),
 (u'audio/ogg', False, u'{id}-13.ogg'),
 (u'audio/ogg', False, u'{id}-10.ogg'),
 (u'audio/ogg', False, u'{id}-11.ogg'),
 (u'audio/ogg', False, u'{id}-07.spx'),
 (u'audio/mp4', False, u'{id}-08.m4b'),
 (u'audio/mp4', False, u'{id}-09.m4b'),
 (u'audio/ogg', False, u'{id}-04.spx'),
 (u'audio/ogg', False, u'{id}-08.spx'),
 (u'audio/mpeg', False, u'{id}-02.mp3'),
 (u'audio/mp4', False, u'{id}-01.m4b'),
 (u'audio/ogg', False, u'{id}-02.spx'),
 (u'audio/ogg', False, u'{id}-06.ogg'),
 (u'audio/mp4', False, u'{id}-06.m4b'),
 (u'text/html', False, u'{id}-index.html')]
[u'http://aleph.gutenberg.org/etext05/19293-h.htm',
 u'http://aleph.gutenberg.org/etext92/19293-h.htm',
 u'http://aleph.gutenberg.org/etext99/19293-h.htm',
 u'http://aleph.gutenberg.org/etext03/19293-h.htm',
 u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.htm',
 u'http://aleph.gutenberg.org/etext00/19293-h.htm',
 u'http://aleph.gutenberg.org/etext01/19293-h.htm',
 u'http://aleph.gutenberg.org/etext90/19293-h.htm',
 u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.html',
 u'http://aleph.gutenberg.org/etext96/19293-h.htm',
 u'http://aleph.gutenberg.org/etext93/19293-h.htm',
 u'http://aleph.gutenberg.org/etext02/19293-h.htm',
 u'http://aleph.gutenberg.org/etext98/19293-h.htm',
 u'http://aleph.gutenberg.org/cache/epub/19293/pg19293.html.utf8',
 u'http://aleph.gutenberg.org/etext95/19293-h.htm',
 u'http://aleph.gutenberg.org/etext94/19293-h.htm',
 u'http://aleph.gutenberg.org/1/9/2/9/19293/19293-h.zip',
 u'http://aleph.gutenberg.org/etext91/19293-h.htm',
 u'http://aleph.gutenberg.org/etext04/19293-h.htm',
 u'http://aleph.gutenberg.org/etext97/19293-h.htm']
Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 4, in <module>
        Downloading content files for Book #30336
__import__('pkg_resources').run_script('gutenberg2zim==1.1.3.0', 'gutenberg2zim')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/EGG-INFO/scripts/gutenberg2zim", line 219, in <module>
    main(docopt(help, version=VERSION))
  File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/EGG-INFO/scripts/gutenberg2zim", line 168, in main
    force=FORCE)
  File "/usr/local/lib/python2.7/dist-packages/gutenberg2zim-1.1.3.0-py2.7.egg/gutenbergtozim/download.py", line 228, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 253, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
http://aleph.gutenberg.org:80 "GET /etext90/18380-h.htm HTTP/1.1" 404 None
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 18: invalid continuation byte
http://aleph.gutenberg.org:80 "GET /etext04/19298-h.html HTTP/1.1" 404 None
    Downloading content files for Book #18381
http://aleph.gutenberg.org:80 "GET /etext90/23105.html.noimages HTTP/1.1" 404 None
Starting new HTTP connection (1): aleph.gutenberg.org
Starting new HTTP connection (1): aleph.gutenberg.org
[pdf] not avail. for #20244# Le Voluptueux Voyage
[epub] Requesting URLs for #30336# Hours in a Library, Volume 2
[epub] Requesting URLs for #18381# De Lotgevallen van Tom Sawyer
Starting new HTTP connection (1): aleph.gutenberg.org
http://aleph.gutenberg.org:80 "GET /etext99/27711-h.zip HTTP/1.1" 404 None
http://aleph.gutenberg.org:80 "GET /etext01/15769-h.zip HTTP/1.1" 404 None
rgaudin commented 4 years ago

Think this is gone now 👍

kelson42 commented 4 years ago

@rgaudin @satyamtg Not sure this is the same rootcause, but the symptom looks really similar. Look at the last scrape log https://farm.openzim.org/pipeline/5eff0ba0a96db4d3a374d0e2/debug.

rgaudin commented 4 years ago

No, it's different. I've open #132