openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
130 stars 37 forks source link

Missing unoptimized folder #124

Closed rgaudin closed 4 years ago

rgaudin commented 4 years ago

Latest zimfarm run failed due to a missing file

Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 4, in <module>
    __import__('pkg_resources').run_script('gutenberg2zim==1.1.4', 'gutenberg2zim')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/EGG-INFO/scripts/gutenberg2zim", line 274, in <module>
    main(docopt(help, version=VERSION))
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/EGG-INFO/scripts/gutenberg2zim", line 235, in main
    optimizer_version=OPTIMIZER_VERSION,
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/export.py", line 281, in export_all_books
    Pool(concurrency).map(dlb, books)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/export.py", line 278, in dlb
    optimizer_version=optimizer_version,
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/export.py", line 573, in export_book
    optimizer_version=optimizer_version,
  File "/usr/local/lib/python3.6/dist-packages/gutenberg2zim-1.1.4-py3.6.egg/gutenbergtozim/export.py", line 808, in handle_unoptimized_files
    for fpath in src_dir.iterdir():
  File "/usr/lib/python3.6/pathlib.py", line 1081, in iterdir
    for name in self._accessor.listdir(self):
  File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'dl-cache/5031/unoptimized'
satyamtg commented 4 years ago

Interesting. It goes into handle_unoptimized_files() only if unoptimized_dir is present. But then it fails inside that function as unoptimized_dir isn't present. Looking into it.

eshellman commented 4 years ago

Possibly related: There had been a 5031 directory in 5030. It was removed in March.

satyamtg commented 4 years ago

It actually happened in a case if unoptimized_dir only contains HTML format book as we process this before any other files. Now, it's processing goes well. However, as the folder now contains no files, it gets deleted, but the scraper proceeds with executing the leftover code in handle_unoptimized_files(). (Fixed that with a simple return).

Another problem related to this arises for other formats if optimized HTML file is already present in static, it would go on proceeding for other formats. However, other format files are not there (as unoptimized_dir only contains HTML) So, fixed this by checking first source files of other formats and then processing them. (This also prevents failure if for some reason, a specific format file is not available in unoptimized_dir, either as it was downloaded from cache or the download failed)