openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
130 stars 37 forks source link

New crash by exporting a book #23

Closed kelson42 closed 9 years ago

kelson42 commented 10 years ago
    Exporting Book #12018.
            Exporting to static/Notes and Queries, Number 17, February 23, 1850.12018.html
            Copying format file to Notes and Queries, Number 17, February 23, 1850.12018.epub
            Creating ePUB at /tmp/tmp9J0uzM.epub
            Exporting to static/Notes and Queries, Number 17, February 23, 1850_cover.12018.html
    Exporting Book #12019.
            Exporting to static/Queen Hortense: A Life Picture of the Napoleonic Era.12019.html

Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books books=books) File "/media/data/gutenberg/gutenberg/export.py", line 378, in export_book_to new_html = update_html_for_static(book=book, html_content=html) File "/media/data/gutenberg/gutenberg/export.py", line 275, in update_html_for_static [1 for e in body.children AttributeError: 'NoneType' object has no attribute 'children' -rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:20 static/authors.js -rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:15 static/authors_lang_en.js

kelson42 commented 10 years ago

Thx, I have restarted an export, feedback probably not before tomorrow.

kelson42 commented 10 years ago

Still crash at the same book, but error slightlyt different crash

kelson42 commented 10 years ago

I reopen this ticket with a new similar crash:

 Exporting Book #13103.
            Exporting to static/Great Britain and Her Queen.13103.html
            Copying companion file to 13103_062 William Whewell, DD.jpg
            Copying /media/data/gutenberg/static/13103_062 William Whewell, DD.jpg
            Copying companion file to 13103_037 Lord Palmerston.jpg
            Copying /media/data/gutenberg/static/13103_037 Lord Palmerston.jpg
            Copying companion file to 13103_004 Kensington Palace.jpg
            Copying /media/data/gutenberg/static/13103_004 Kensington Palace.jpg
            Copying companion file to 13103_001 Queen Victoria.jpg
            Copying /media/data/gutenberg/static/13103_001 Queen Victoria.jpg
            Copying companion file to 13103_061 Thomas Carlyle.jpg
            Copying /media/data/gutenberg/static/13103_061 Thomas Carlyle.jpg
            Copying companion file to 13103_032 Sir John Lawrence.jpg
            Copying /media/data/gutenberg/static/13103_032 Sir John Lawrence.jpg
            Copying companion file to 13103_053 Robert Southey.jpg
            Copying /media/data/gutenberg/static/13103_053 Robert Southey.jpg
            Copying companion file to 13103_040 The Mausoleum.jpg
            Copying /media/data/gutenberg/static/13103_040 The Mausoleum.jpg
            Copying companion file to 13103_092 Wesley preaching on his father's tomb.jpg
            Copying /media/data/gutenberg/static/13103_092 Wesley preaching on his father's tomb.jpg

Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books books=books) File "/media/data/gutenberg/gutenberg/export.py", line 547, in export_book_to handle_companion_file(fname) File "/media/data/gutenberg/gutenberg/export.py", line 519, in handle_companion_file optimize_image(dst) File "/media/data/gutenberg/gutenberg/export.py", line 418, in optimize_image return optimize_jpeg(fpath) File "/media/data/gutenberg/gutenberg/export.py", line 434, in optimize_jpeg .format(path=fpath)) File "/media/data/gutenberg/gutenberg/utils.py", line 46, in exec_cmd return envoy.run(str(cmd.encode('utf-8'))) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/envoy/core.py", line 157, in run command = expand_args(command) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/envoy/core.py", line 146, in expand_args command = map(shlex.split, command) File "/usr/lib/python2.7/shlex.py", line 279, in split return list(lex) File "/usr/lib/python2.7/shlex.py", line 269, in next token = self.get_token() File "/usr/lib/python2.7/shlex.py", line 96, in get_token raw = self.read_token() File "/usr/lib/python2.7/shlex.py", line 172, in read_token raise ValueError, "No closing quotation" ValueError: No closing quotation

kelson42 commented 10 years ago

Thanks, new test run started.

kelson42 commented 10 years ago

It seems your last fix has introduced a regression (crashing now early at #2810):

    Exporting Book #2809.
            Exporting to static/Main-Travelled Roads.2809.html
            Copying format file to Main-Travelled Roads.2809.epub
            Creating ePUB at /tmp/tmpHetUAn.epub
            Exporting to static/Main-Travelled Roads_cover.2809.html
    Exporting Book #2810.
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.html
            Copying format file to Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.epub
            Creating ePUB at /tmp/tmpT0QPPl.epub

Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books books=books) File "/media/data/gutenberg/gutenberg/export.py", line 557, in export_book_to archive_name_for(book, format)) File "/media/data/gutenberg/gutenberg/export.py", line 525, in handle_companion_file path(tmp_epub.name).move(dst) File "/usr/lib/python2.7/shutil.py", line 299, in move copy2(src, real_dst) File "/usr/lib/python2.7/shutil.py", line 128, in copy2 copyfile(src, dst) File "/usr/lib/python2.7/shutil.py", line 83, in copyfile with open(dst, 'wb') as fdst: IOError: [Errno 36] File name too long: u'/media/data/gutenberg/static/Plunkitt\ of\ Tammany\ Hall:\ a\ series\ of\ very\ plain\ talks\ on\ very\ practical\ politics,\ delivered\ by\ ex-Senator\ George\ Washington\ Plunkitt,\ the\ Tammany\ philosopher,\ from\ his\ rostrum\u2014the\ New\ York\ County\ court\ house\ bootblack\ stand;\ Reco.2810.epub'

kelson42 commented 10 years ago

It seems that truncating fname from 230 to 210 fixes the bug. But I don't understand why your last commit generates this regression, so let you have a look.

kelson42 commented 10 years ago

The fix seems to be:

Cmd line arguments seem to be quoted with single quote... so only this character should be escaped... other you add many characters to the filename (and generate other trouble). Please confirm.

kelson42 commented 10 years ago

I have commited and I close this, pretty sure this is the good solution.

kelson42 commented 10 years ago

Stil crashing with a quoting issue: Copying /media/data/gutenberg/static/13103_062 William Whewell, DD.jpg Copying companion file to 13103_037 Lord Palmerston.jpg Copying /media/data/gutenberg/static/13103_037 Lord Palmerston.jpg Copying companion file to 13103_004 Kensington Palace.jpg Copying /media/data/gutenberg/static/13103_004 Kensington Palace.jpg Copying companion file to 13103_001 Queen Victoria.jpg Copying /media/data/gutenberg/static/13103_001 Queen Victoria.jpg Copying companion file to 13103_061 Thomas Carlyle.jpg Copying /media/data/gutenberg/static/13103_061 Thomas Carlyle.jpg Copying companion file to 13103_032 Sir John Lawrence.jpg Copying /media/data/gutenberg/static/13103_032 Sir John Lawrence.jpg Copying companion file to 13103_053 Robert Southey.jpg Copying /media/data/gutenberg/static/13103_053 Robert Southey.jpg Copying companion file to 13103_040 The Mausoleum.jpg Copying /media/data/gutenberg/static/13103_040 The Mausoleum.jpg Copying companion file to 13103_092 Wesley preaching on his father's tomb.jpg Copying /media/data/gutenberg/static/13103_092 Wesley preaching on his father's tomb.jpg Traceback (most recent call last): File "./dump-gutenberg.py", line 154, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 141, in main only_books=BOOKS) File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books books=books) File "/media/data/gutenberg/gutenberg/export.py", line 550, in export_book_to handle_companion_file(fname) File "/media/data/gutenberg/gutenberg/export.py", line 521, in handle_companion_file optimize_image(dst) File "/media/data/gutenberg/gutenberg/export.py", line 418, in optimize_image return optimize_jpeg(fpath) File "/media/data/gutenberg/gutenberg/export.py", line 434, in optimize_jpeg .format(path=fpath)) File "/media/data/gutenberg/gutenberg/utils.py", line 47, in exec_cmd return envoy.run(str(cmd.encode('utf-8'))) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/envoy/core.py", line 157, in run command = expand_args(command) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/envoy/core.py", line 146, in expand_args command = map(shlex.split, command) File "/usr/lib/python2.7/shlex.py", line 279, in split return list(lex) File "/usr/lib/python2.7/shlex.py", line 269, in next token = self.get_token() File "/usr/lib/python2.7/shlex.py", line 96, in get_token raw = self.read_token() File "/usr/lib/python2.7/shlex.py", line 172, in read_token raise ValueError, "No closing quotation" ValueError: No closing quotation

rgaudin commented 10 years ago

Ah you introduced that one. Doesn't fail with my previous commit

On Wed, Oct 8, 2014 at 1:43 PM, Kelson notifications@github.com wrote:

Reopened #23 https://github.com/kiwix/gutenberg/issues/23.

— Reply to this email directly or view it on GitHub https://github.com/kiwix/gutenberg/issues/23#event-175752132.

kelson42 commented 10 years ago

The problem of your this patch is that it was adding back slashes before spaces (and I guess also before double quote) on all EPUB files (for example).

I don't know what is the solution, but might that be that the way the command line arguments are quoted dependes from the content (most of the time single quote, but time to time doublequote)?

kelson42 commented 9 years ago

I think I have fixed that bug https://github.com/kiwix/gutenberg/commit/3343ddef7d0e3091a69df63e5e85fa5826099b20