openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

regression: Missing HTML content #219

Open rgaudin opened 3 months ago

rgaudin commented 3 months ago

In gutenberg_en_all_2024-02, out of the 10 books listed (all declaring offering an HTML version), only three do have an HTML version.

Screenshot 2024-03-05 at 10 06 08

Either HTML version it is not present in the ZIM or the link is incorrect (it's same link in listing and in preview page). This is not limited to those 10 entries but it makes this 75GB ZIM look like garbage.

Initially reported by Offspot user.

benoit74 commented 3 months ago

Who ...

If you search for Mary Wollstonecraft Shelley in the author search box, you will realize there are 3 versions of the Frankenstein book (book ids 84, 41445, 42324)

Same for Moby Dick, 2 versions. Probably same for other as well.

I looked at the ePub (which are OK) and the content is slightly different, so there is clearly a difference between these books.

I looked at https://aleph.pglaf.org/cache/epub/84/ and we see there is an HTML version of book 84, with illustrations.

I will have to reproduce locally, but as usual it will take some time to rebuild the local database from rsync result.

Popolechien commented 3 months ago

Ok I’ve tried the first two pages and about 2/3 of the books are missing. It gets better as one goes deeper, but it is the first impression that matters.

I’ve tested a half dozen other languages, no problem there but there weren't many (or any) that had several versions of the same book.

I have put the recipe on hold on Zimfarm

@benoit74 can you please delete https://download.kiwix.org/zim/gutenberg/gutenberg_mul_all_2024-02.zim https://download.kiwix.org/zim/gutenberg/gutenberg_mul_all_2024-01.zim https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2024-01.zim and https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2024-02.zim

rgaudin commented 3 months ago

@Popolechien please make a removal request on zim-requests with appropriate flag otherwise we'll lost it and you'll open a ticket in 6 months asking where gutenberg files are 😉

kelson42 commented 3 months ago

I'm concerned about a general problem here which might lead to pausing all recipes... or do we have a chance to know exactly which ZIM are impacted?

benoit74 commented 3 months ago

I'm concerned about a general problem here which might lead to pausing all recipes... or do we have a chance to know exactly which ZIM are impacted?

You mean, all gutenberg recipes, right? (there is only one btw)

@Popolechien tests have shown that it looks like only en (and hence mul) are impacted, as far as we can tell (probably wrong, but occurrences are at least way more visible in other languages).

@Popolechien: you wanna remove both ZIMs we have because you tested both and they both have the issue?

Are we sure we want to do this (not provide the en + mul ZIM anymore) given the fact that ePub / PDF is still available?

Should we run a new (temporary) recipe for en + mul with only PDF + ePub as requested formats, so that at least the ZIM does not contain invalid links? We could name it the nohtml flavor. It is just a matter of configuration normally.

kelson42 commented 3 months ago

This seems a pretty serious issue IMHO. If I get everything right, the wise thing would be to deactivate the gutenber recipe until this issue is closed and new release done.

Popolechien commented 3 months ago

@rgaudin yeah my bad for some reason I thought the ticket was there already. @benoit74 Yes both en and mul. If 2/3 of the content seems missing on the first page (and I stress that it seems missing) then that's very sub-par UX. Considering the size of the zim and the time/data costs invested to download it, I'd rather not impose this on users.

I was not aware of the possibility of running the recipe with PDF adn ePubs only, but that seems acceptable, yes.

benoit74 commented 3 months ago

You tested both 2024_01 and 2024_02 ZIMs, both have the issue?

benoit74 commented 3 months ago

Recipe for only epub+pdf is here: https://farm.openzim.org/recipes/gutenberg_mul_epub-pdf

Can you confirm we want this and you did not spotted any stupid thing? I've activated the "multiple ZIM" mode, should we discover we have the issue in other languages as well, we will be happy to have ZIMs in all languages. It should take about 1 day to produce if I trust last run duration.

benoit74 commented 3 months ago

This seems a pretty serious issue IMHO. If I get everything right, the wise thing would be to deactivate the gutenber recipe until this issue is closed and new release done.

This has been done more than one hour ago

benoit74 commented 3 months ago

OK, so regarding the "real issue", I have debugged the scraper logic for book 84.

Foreword: this scraper logic is a nightmare, I won't dive into details

As you've probably already guessed, there are basically two issues:

scraper does not care that HTML version has not been found when it renders the UI

I suspected that first part could be a regression induced by https://github.com/openzim/gutenberg/pull/163 but I don't think so, at least it seems that situation has been enhanced by this PR but not fully fixed : before this PR, buttons where always displayed when the book was supposed to have a given format available according to RDF ; with the PR (now), the buttons are hidden if a given format is not requested ; we should go further and also hide the button if we do not achieve to download the requested format.

scraper does not achieves to find the HTML version of book

For book 84, the various versions present at https://www.gutenberg.org/files/84/84-h/84-h.htm or at https://www.gutenberg.org/cache/epub/84/pg84-images.html (also redirected here from "magic logic" from @eshellman which gives https://www.gutenberg.org/ebooks/84.html.images for this book HTML) are not among the 10s of potential URLs considered by the scraper (see code block below).

"http://aleph.pglaf.org/8/84/84-h.htm"
"http://aleph.pglaf.org/8/84/84-h.html"
"http://aleph.pglaf.org/8/84/84-h.zip"
"http://aleph.pglaf.org/cache/epub/84/pg84.html.utf8"
"http://aleph.pglaf.org/etext00/84-h.htm"
"http://aleph.pglaf.org/etext01/84-h.htm"
"http://aleph.pglaf.org/etext02/84-h.htm"
"http://aleph.pglaf.org/etext03/84-h.htm"
"http://aleph.pglaf.org/etext04/84-h.htm"
"http://aleph.pglaf.org/etext05/84-h.htm"
"http://aleph.pglaf.org/etext90/84-h.htm"
"http://aleph.pglaf.org/etext91/84-h.htm"
"http://aleph.pglaf.org/etext92/84-h.htm"
"http://aleph.pglaf.org/etext93/84-h.htm"
"http://aleph.pglaf.org/etext94/84-h.htm"
"http://aleph.pglaf.org/etext95/84-h.htm"
"http://aleph.pglaf.org/etext96/84-h.htm"
"http://aleph.pglaf.org/etext97/84-h.htm"
"http://aleph.pglaf.org/etext98/84-h.htm"
"http://aleph.pglaf.org/etext99/84-h.htm"

For book 41445 (which works), the HTML version is found at http://aleph.pglaf.org/4/1/4/4/41445/41445-h.zip

Full list of potential URLs for 41445 below:

"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.htm"
"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.html"
"http://aleph.pglaf.org/4/1/4/4/41445/41445-h.zip" <= found in RSYNC results, present on server
"http://aleph.pglaf.org/cache/epub/41445/pg41445.html.utf8"
"http://aleph.pglaf.org/etext00/41445-h.htm"
"http://aleph.pglaf.org/etext01/41445-h.htm"
"http://aleph.pglaf.org/etext02/41445-h.htm"
"http://aleph.pglaf.org/etext03/41445-h.htm"
"http://aleph.pglaf.org/etext04/41445-h.htm"
"http://aleph.pglaf.org/etext05/41445-h.htm"
"http://aleph.pglaf.org/etext90/41445-h.htm"
"http://aleph.pglaf.org/etext91/41445-h.htm"
"http://aleph.pglaf.org/etext92/41445-h.htm"
"http://aleph.pglaf.org/etext93/41445-h.htm"
"http://aleph.pglaf.org/etext94/41445-h.htm"
"http://aleph.pglaf.org/etext95/41445-h.htm"
"http://aleph.pglaf.org/etext96/41445-h.htm"
"http://aleph.pglaf.org/etext97/41445-h.htm"
"http://aleph.pglaf.org/etext98/41445-h.htm"
"http://aleph.pglaf.org/etext99/41445-h.htm"

what next

We cannot add the https://aleph.pglaf.org/8/84/84-h/84-h.htm pattern to the list generated above because the 8/84/84-h folder is normally reserved for "extracted ZIP" version, i.e. we find in this folder not only the HTML but also all images. And in such a situation we do not want to grab only the HTML since we need all the images as well for proper rendering.

I'm not very inclined to fix only the fact that scraper does not care that HTML version has not been found when it renders the UI, because as far as I've understood, HTML version is very important for our users (see comments on https://github.com/openzim/gutenberg/issues/161). Fixing only this could help as an interim solution to "at least build a relevant ZIM without buttons leading to nowhere", but I do not recommend this approach which is only putting lipstick on a pig.

I think that at this point we need to invest time in seriously simplifying the scraper code to get rid of all "fallback" mechanisms we have and are only biting us now.

In other words, finally implement what has been imagined and more or less prepared in https://github.com/openzim/gutenberg/issues/97 (I just renamed it, we won't move to OPDS catalog according to latest discussions in the issue):

WDYT?

Popolechien commented 3 months ago

LGTM, thanks a lot.

Regarding the interim recipe, I've disabled the multiple languages output (we would have duplicates files with almost the same name and for very limited added value, I find this confusing rather than helpful) - let's see this as an English problem and an English fix. I have changed the language settings (and recipe name) accordingly, please double check before launching the recipe.

I have also disabled the bookshelves feature, apparently according to #184 the feature is not maintained by Gutenberg folks.

benoit74 commented 3 months ago

Interim recipe started.

Be aware that doing it only for English also means we will not provide the mul big ZIM anymore in the interim.

I don't get what the problem is about the mostly similar name, we already have this situation for Wikipedia with its flavor. Mostly same name, same title, same description, only size differ. It is only a UI issue.

Popolechien commented 3 months ago

Be aware that doing it only for English also means we will not provide the mul big ZIM anymore in the interim.

I'm fine with that, it's use case always seemed dubious to me in the first place.

Regarding the Wikipedia example, that's exactly the problem I had in mind (the question comes regularly as to why these three and what the difference is, despite all the FAQ, message, etc.)

eshellman commented 3 months ago

Over the past 2-3 years, a lot of effort has been put into upgrading all 70,000 books in PG books to validated html5 and epub3. There are two trees in the file system, the "1/2/3/4/5" tree, and the "cache/epub" tree. The generated epub3 and html5 files are in the "cache/epub" tree. Both of these are in the aleph mirror. I don't remember how we were handling epub, but the generated HTML5 was not yet implemented when this was last implemented.

as you might expect, the generated html5 is much more uniform in quality compared to the source files, which come in all sorts of htm and txt flavors!

benoit74 commented 3 months ago

https://farm.openzim.org/recipes/gutenberg_en_epub-pdf did not produced the expected outcome, I forgot again that HTML format is mandatory (see https://github.com/openzim/gutenberg/issues/161), we can only request to not put epub or pdf in the ZIM ...

I've disabled the recipe (we can probably delete it, it is only misleading) and the ZIM (still suffering the same HTML issue).

@kelson42 do you consider this is a fast-track issue which needs to be fixed asap (i.e. with more priority than other projects I have)?

benoit74 commented 3 months ago

Over the past 2-3 years, a lot of effort has been put into upgrading all 70,000 books in PG books to validated html5 and epub3. There are two trees in the file system, the "1/2/3/4/5" tree, and the "cache/epub" tree. The generated epub3 and html5 files are in the "cache/epub" tree. Both of these are in the aleph mirror. I don't remember how we were handling epub, but the generated HTML5 was not yet implemented when this was last implemented.

as you might expect, the generated html5 is much more uniform in quality compared to the source files, which come in all sorts of htm and txt flavors!

I now really consider it is mandatory to do the necessary changes to fix https://github.com/openzim/gutenberg/issues/97 and have a scraper which is both faster, easier to maintain and producing a ZIM with more uniform quality

kelson42 commented 3 months ago

@benoit74 How much work do you estimate to be able to bring things back to normal in good and substainable conditions?

benoit74 commented 3 months ago

@kelson42 In man days, 5 to 10 days probably (including PoC, reviews, ...). In elapse ...

eshellman commented 3 months ago

Anything I can help with, let me know.

On Mar 7, 2024, at 7:46 AM, benoit74 @.***> wrote:

@kelson42 https://github.com/kelson42 In man days, 5 to 10 days probably (including PoC, reviews, ...). In elapse ...

— Reply to this email directly, view it on GitHub https://github.com/openzim/gutenberg/issues/219#issuecomment-1983436348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMONDHXTJ4XKCFG2CLLYXBOT7AVCNFSM6AAAAABEG2D7RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGQZTMMZUHA. You are receiving this because you were mentioned.

eshellman commented 3 months ago

I've added an update to #97 that I hope will help

benoit74 commented 3 months ago

Thank you!