Closed kelson42 closed 9 years ago
I have tried to fix this with:
def book_name_for_fs(book): return book.title.strip().replace('/', '-')[:230]
But then:
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
just remove the .encode("ascii")
part and that should be it: string limited to 230 chars.
Sorry, the "ascii" stuff was just a trial (here is the correct error):
Exporting Book #2810.
Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.html
Copying format file to Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco.2810.epub
Creating ePUB at /tmp/tmp3LjMTz.epub
Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Reco_cover.2810.html
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
OK, there are two ways to fix it:
def book_name_for_fs(book):
return book.title.strip().replace('/', '-').replace('\u2014', '-')
import unicodedata
def book_name_for_fs(book):
title = book.title.strip().replace('/', '-')
return unicodedata.normalize('NFKD', title.decode('utf-8', 'replace')).encode('ascii', 'ignore')
The problem with the latter solution is that we don't know how to match this perfectly on the other end (Javascript). If that's not required to get the same result in JS then we can use it. On the other end, the first version might not be enough for our data set.
OK, the first solution works (by doing the same in js). It's really not pretty, but at least if an other problem of encoding appears, the script will crash (so will let us know). Thx for your help.
OK, now I have an other not supported character :( Don't think the first approach is going to work.
Second approach generates an error:
Exporting Book #2810.
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
Honestly, I don't understand why we have a problem here. Everything should work in UTF8. What is the problem with this title? It's not UTF8?
it is. It's the urllib.quote function that doesn't work with UTF-8.
The following should work:
def book_name_for_fs(book):
return book.title.strip().replace('/', '-').encode('utf-8')
:(((
def book_name_for_fs(book):
rm -rf static/ ; ./dump-gutenberg.py --books=2810 --download --export --zim ; ./kiwix-serve --port=8081 gutenberg_mul_all_09_2014.zim
DOWNLOADING ebooks from mirror using filters
[2810]
Downloading content files for Book #2810
epub already exists at dl-cache/2810.epub
[pdf] not avail. for #2810# Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Recorded by William L. Riordon
html already exists at dl-cache/2810.html
EXPORTING ebooks to static folder (and JSON)
[2810]
Filtered book collection size: 1
Filtered book collection, PDF: 0
Filtered book collection, ePUB: 1
Filtered book collection, HTML: 1
Dumping full_by_popularity.js
Dumping full_by_title.js
Dumping lang_en_by_popularity.js
Dumping lang_en_by_title.js
Dumping authors_lang_en.js
Dumping auth_1035_by_popularity.js
Dumping auth_1035_by_title.js
Dumping authors.js
Dumping languages.js
Dumping main_languages.js
Exporting Book #2810.
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
My bad. Use this:
def book_name_for_fs(book):
return book.title.strip().replace('/', '-')
def urlencode(url):
return urllib.quote(url.encode('utf-8'))
Seems to work fine now :))))
Traceback (most recent call last): File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 149, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 365, in export_book_to
with open(article_fpath, 'w') as f:
IOError: [Errno 36] File name too long: u'static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum\u2014the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html'