openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

EPUB files and covers are no longer in the download as of 2022-02 #147

Closed jules2689 closed 2 years ago

jules2689 commented 2 years ago

EPUB

I've been looking at the gutenburg zim files and I've noticed that there is inconsistent availability for the epub files.

For example, Pride and Prejudice is not available in 2022-02:

However in 2022-01 EPUB works (I downloaded the old version and tried it out).

I continued testing other files and I could not find an epub that works in 2022-02. I found that most of the links worked in 2022-01, however it is important to note that some like A Tale of Two Cities was still not found in 2022-01.

If you look at the file size, 2022-02 is ~10GB less than 2022-01:

Covers

2022-02 is also missing covers.

2022-02: 2022-02 is missing covers

2022-01: 2022-01 has covers

kelson42 commented 2 years ago

@eshellman I have deleted your comment as IP informations are private.

kelson42 commented 2 years ago

The log confirms that the EPUB can not be found:

$ grep -P '#1342[#./, ]{1}' 878b4a6d9463cffec1e56026_gutenberg.log 
[gutenbergtozim::2022-02-11 17:50:01,921] DEBUG:[epub] Requesting URLs for #1342# Pride and Prejudice
[gutenbergtozim::2022-02-11 17:50:02,248] ERROR:NO FILE FOR #1342/epub
[gutenbergtozim::2022-02-11 17:50:02,282] DEBUG:[pdf] Requesting URLs for #1342# Pride and Prejudice
[gutenbergtozim::2022-02-11 17:50:02,496] ERROR:NO FILE FOR #1342/pdf
[gutenbergtozim::2022-02-11 17:50:02,601] DEBUG:[html] Requesting URLs for #1342# Pride and Prejudice
[gutenbergtozim::2022-02-11 23:48:33,233] INFO: Exporting Book #1342.
[gutenbergtozim::2022-02-11 23:48:33,234] WARNING:Missing HTML content for #1342 at dl-cache/1342/unoptimized/1342.html
[gutenbergtozim::2022-02-12 09:04:36,051] INFO: Exporting Book #1342.
[gutenbergtozim::2022-02-12 09:04:36,051] WARNING:Missing HTML content for #1342 at dl-cache/1342/unoptimized/1342.html
kelson42 commented 2 years ago

@rgaudin Any idea what is going on here?

eshellman commented 2 years ago

looks to me like an issue at the gutenberg mirror. I'm looking into it.

kelson42 commented 2 years ago

@eshellman Thank you very much!

eshellman commented 2 years ago

the cache/epub/* tree no longer fits on aleph; use dante instead. https://www.gutenberg.org/dirs/MIRRORS.ALL

kelson42 commented 2 years ago

@eshellman dante is not in the list @rgaudin http://aleph.gutenberg.org/cache/epub/ which is used anyway as URL base constructor in a few places does not exist at all anymore looks like since Fall 2021.

rgaudin commented 2 years ago

@eshellman, as @kelson42 pointed, dante is not in the list and neither http://dante.gutenberg.org/ nor ftp://dante.gutenberg.org/ work. I'll update the mirror once you give us its address.

eshellman commented 2 years ago

apologies. https://dante.pglaf.org/