openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Decoding error on rdf parsing #132

Closed rgaudin closed 3 years ago

rgaudin commented 4 years ago

Latest run failed due to an encoding error when parsing RDF files. It's quite unexpected as previous run did not have this problem and this files doesn't seem to have been updated. Also, its content looks safe (should be a book between 11850 and 11966).

Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 4, in <module>
    __import__('pkg_resources').run_script('gutenberg2zim==1.1.4', 'gutenberg2zim')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/EGG-INFO/scripts/gutenberg2zim", line 274, in <module>
    main(docopt(help, version=VERSION))
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/EGG-INFO/scripts/gutenberg2zim", line 188, in main
    rdf_path=RDF_FOLDER, only_books=BOOKS, concurrency=CONCURRENCY, force=FORCE
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/rdf.py", line 87, in parse_and_fill
    Pool(concurrency).map(ppf, fpaths)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/rdf.py", line 85, in ppf
    return parse_and_process_file(x, force)
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/rdf.py", line 102, in parse_and_process_file
    parser = RdfParser(f.read(), gid).parse()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5806: ordinal not in range(128)
kelson42 commented 4 years ago

Can probably be closed. Thx for the fix.

kelson42 commented 3 years ago

Sorry, still wrong https://farm.openzim.org/pipeline/5f00b76aa96db4d3a376f5ab/debug

satyamtg commented 3 years ago

This seems to be a different error though. Last time it was while parsing RDF. It seems that this time parsing went well but the path-name wasn't good enough. Investigating. Also, most probably its due to book #135

kelson42 commented 3 years ago

@satyamtg Thank you for checking, we should probably open a new ticket then.

satyamtg commented 3 years ago

@satyamtg Thank you for checking, we should probably open a new ticket then.

I don't think so. Error is at a different place but seems related to the fix that has been done. Though it wasn't reproduced with a subset of books numbered 1 to 150.

rgaudin commented 3 years ago

It's a different error. Fixed it in master