openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
128 stars 37 forks source link

Too long filename #22

Closed kelson42 closed 9 years ago

kelson42 commented 9 years ago
    Exporting Book #2810.
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html

Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/gutenberg/gutenberg/export.py", line 149, in export_all_books books=books) File "/media/data/gutenberg/gutenberg/export.py", line 365, in export_book_to with open(article_fpath, 'w') as f: IOError: [Errno 36] File name too long: u'static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum\u2014the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html'

kelson42 commented 9 years ago

I have tried to fix this with:

def book_name_for_fs(book): return book.title.strip().replace('/', '-')[:230]

But then: Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/projs/gutenberg/gutenberg/export.py", line 148, in export_all_books books=books) File "/media/data/projs/gutenberg/gutenberg/export.py", line 361, in export_book_to article_fpath = os.path.join(static_folder, article_name_for(book)) File "/media/data/projs/gutenberg/gutenberg/export.py", line 155, in article_name_for title = book_name_for_fs(book) File "/media/data/projs/gutenberg/gutenberg/export.py", line 53, in book_name_for_fs return book.title.strip().replace('/', '-')[:230].encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 176: ordinal not in range(128)

rgaudin commented 9 years ago

just remove the .encode("ascii") part and that should be it: string limited to 230 chars.

kelson42 commented 9 years ago

Sorry, the "ascii" stuff was just a trial (here is the correct error):

    Exporting Book #2810.
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.html
            Copying format file to Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.epub
            Creating ePUB at /tmp/tmp3LjMTz.epub
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco_cover.2810.html

Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/projs/gutenberg/gutenberg/export.py", line 148, in export_all_books books=books) File "/media/data/projs/gutenberg/gutenberg/export.py", line 542, in export_book_to books=books) File "/media/data/projs/gutenberg/gutenberg/export.py", line 348, in cover_html_content_for return template.render(**context) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/jinja2/environment.py", line 969, in render return self.environment.handle_exception(exc_info, True) File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/jinja2/environment.py", line 742, in handle_exception reraise(exc_type, exc_value, tb) File "/media/data/projs/gutenberg/gutenberg/templates/cover_article.html", line 1, in top-level template code {% extends "base.html" %} File "/media/data/projs/gutenberg/gutenberg/templates/base.html", line 78, in top-level template code {% block content %} File "/media/data/projs/gutenberg/gutenberg/templates/cover_article.html", line 43, in block "content" File "/media/data/projs/gutenberg/gutenberg/export.py", line 56, in urlencode return urllib.quote(url) File "/usr/lib/python2.7/urllib.py", line 1288, in quote return ''.join(map(quoter, s)) KeyError: u'\u2014'

rgaudin commented 9 years ago

OK, there are two ways to fix it:

def book_name_for_fs(book):
    return book.title.strip().replace('/', '-').replace('\u2014', '-')
import unicodedata
def book_name_for_fs(book):
    title = book.title.strip().replace('/', '-')
    return unicodedata.normalize('NFKD', title.decode('utf-8', 'replace')).encode('ascii', 'ignore')

The problem with the latter solution is that we don't know how to match this perfectly on the other end (Javascript). If that's not required to get the same result in JS then we can use it. On the other end, the first version might not be enough for our data set.

kelson42 commented 9 years ago

OK, the first solution works (by doing the same in js). It's really not pretty, but at least if an other problem of encoding appears, the script will crash (so will let us know). Thx for your help.

kelson42 commented 9 years ago

OK, now I have an other not supported character :( Don't think the first approach is going to work.

Second approach generates an error: Exporting Book #2810. Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/projs/gutenberg/gutenberg/export.py", line 152, in export_all_books books=books) File "/media/data/projs/gutenberg/gutenberg/export.py", line 365, in export_book_to article_fpath = os.path.join(static_folder, article_name_for(book)) File "/media/data/projs/gutenberg/gutenberg/export.py", line 159, in article_name_for title = book_name_for_fs(book) File "/media/data/projs/gutenberg/gutenberg/export.py", line 55, in book_name_for_fs return unicodedata.normalize('NFKD', title.decode('utf-8', 'replace')).encode('ascii', 'ignore') File "/home/kelson/.virtualenvs/gut/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 176: ordinal not in range(128)

Honestly, I don't understand why we have a problem here. Everything should work in UTF8. What is the problem with this title? It's not UTF8?

rgaudin commented 9 years ago

it is. It's the urllib.quote function that doesn't work with UTF-8.

The following should work:

def book_name_for_fs(book):
    return book.title.strip().replace('/', '-').encode('utf-8')
kelson42 commented 9 years ago

:(((

def book_name_for_fs(book):

rm -rf static/ ; ./dump-gutenberg.py --books=2810 --download --export --zim ; ./kiwix-serve --port=8081 gutenberg_mul_all_09_2014.zim DOWNLOADING ebooks from mirror using filters [2810] Downloading content files for Book #2810 epub already exists at dl-cache/2810.epub [pdf] not avail. for #2810# Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Recorded by William L. Riordon html already exists at dl-cache/2810.html EXPORTING ebooks to static folder (and JSON) [2810] Filtered book collection size: 1 Filtered book collection, PDF: 0 Filtered book collection, ePUB: 1 Filtered book collection, HTML: 1 Dumping full_by_popularity.js Dumping full_by_title.js Dumping lang_en_by_popularity.js Dumping lang_en_by_title.js Dumping authors_lang_en.js Dumping auth_1035_by_popularity.js Dumping auth_1035_by_title.js Dumping authors.js Dumping languages.js Dumping main_languages.js Exporting Book #2810. Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in main(docopt(help, version=0.1)) File "./dump-gutenberg.py", line 137, in main only_books=BOOKS) File "/media/data/projs/gutenberg/gutenberg/export.py", line 148, in export_all_books books=books) File "/media/data/projs/gutenberg/gutenberg/export.py", line 361, in export_book_to article_fpath = os.path.join(static_folder, article_name_for(book)) File "/media/data/projs/gutenberg/gutenberg/export.py", line 156, in article_name_for return "{title}{cover}.{id}.html".format(title=title, cover=cover, id=book.id) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 176: ordinal not in range(128)

rgaudin commented 9 years ago

My bad. Use this:

def book_name_for_fs(book):
    return book.title.strip().replace('/', '-')

def urlencode(url):
    return urllib.quote(url.encode('utf-8'))
kelson42 commented 9 years ago

Seems to work fine now :))))