openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
127 stars 37 forks source link

PDF not always downloaded correctly #32

Closed kelson42 closed 7 years ago

kelson42 commented 9 years ago

For the following book, a PDF is available: http://www.gutenberg.org/ebooks/11

Have a look on the mirror: http://gutenberg.readingroo.ms/1/11/

But the script seems to be unable to download it: $ rm -rf static/ ; ./dump-gutenberg.py --keep-db --download --books=11 DOWNLOADING ebooks from mirror using filters [11] Downloading content files for Book #11 epub already exists at dl-cache/11.epub [pdf] Requesting URLs for #11# Alice's Adventures in Wonderland Starting new HTTP connection (1): gutenberg.readingroo.ms "GET /cache/generated/11/11-pdf.pdf HTTP/1.1" 404 227 Starting new HTTP connection (1): gutenberg.readingroo.ms "GET /cache/generated/11/11.pdf HTTP/1.1" 404 223 Starting new HTTP connection (1): gutenberg.readingroo.ms "GET /cache/generated/11/pg11.pdf HTTP/1.1" 404 225 NO FILE FOR #11/pdf [u'http://gutenberg.readingroo.ms/cache/generated/11/pg11.pdf', u'http://gutenberg.readingroo.ms/cache/generated/11/11.pdf', u'http://gutenberg.readingroo.ms/cache/generated/11/11-pdf.pdf'] html already exists at dl-cache/11.html (gut)kelson@zimfarm:/media/data/gutenberg$ ls -la http://gutenberg.readingroo.ms/cache/generated/11/11.pdf ls: cannot access http://gutenberg.readingroo.ms/cache/generated/11/11.pdf: No such file or directory

kelson42 commented 9 years ago

We seem to have a similar problem with #1342

tim-moody commented 9 years ago

in gutenberg_en_all_10_2014 I sampled half a dozen that have the pdf icon and none has the pdf file.