openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
127 stars 37 forks source link

Trying to make ZIM files in all languages failes #72

Closed kelson42 closed 4 years ago

kelson42 commented 5 years ago

Started

for I in `cat languages_06_2018 `; do ./gutenberg2zim --languages=$I ; done

and got at some point

    Skipping already parsed file rdf-files/58098/pg58098.rdf
    Skipping already parsed file rdf-files/57197/pg57197.rdf
    Skipping already parsed file rdf-files/58100/pg58100.rdf
    Skipping already parsed file rdf-files/57198/pg57198.rdf
    Skipping already parsed file rdf-files/58101/pg58101.rdf
    Skipping already parsed file rdf-files/57199/pg57199.rdf
    Skipping already parsed file rdf-files/58102/pg58102.rdf
    Skipping already parsed file rdf-files/57200/pg57200.rdf
    Skipping already parsed file rdf-files/57201/pg57201.rdf
    Skipping already parsed file rdf-files/57202/pg57202.rdf
    Skipping already parsed file rdf-files/57203/pg57203.rdf
    Skipping already parsed file rdf-files/57204/pg57204.rdf
Add possible url to db
bash -c rsync -a --list-only rsync://aleph.gutenberg.org/gutenberg/ > tmp/file_on_aleph_gutenberg_org
sed -i s#.* \(.*\)$#\1# tmp/file_on_aleph_gutenberg_org
DOWNLOADING ebooks from mirror using filters
EXPORTING ebooks to static folder (and JSON)
ERROR: Unable to proceed. Combination of lamguages, books and formats has no result.

and process is frozen...

dattaz commented 5 years ago

Strange behaviour, because we exit just after printing "ERROR: Unable to proceed. Combination of lamguages, books and formats has no result.", so bash loop should pass to next...Do you know which language fail ? Also, "languages_06_2018" has some comments at begin that should be ignore, maybe this issue come from this (use tail -n X languages_06_2018 instead of cat)

rgaudin commented 4 years ago

If not specified, formats used to be 'epub', 'pdf'. If for some reason one language only has html books (several languages only have one book!), then you could end up with this situation.

To avoid this, and also save a lot of time not looking through (even though you're not parsing them) all the RDF files, use the --one-language-one-zim=<folder> option.

You can also retry with current version which includes pdf as well.

Oh and @dattaz's answer on comments might be valid.