openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

UI does not limit the buttons displayed to only requested formats #159

Closed benoit74 closed 1 year ago

benoit74 commented 1 year ago

When we request at the CLI a list of formats (e.g epub + html) which is less than the list of formats of a given book (e.g. epub + html + pdf), only appropriate formats are exported in the ZIM (epub + html) but the UI shows a link to all existing formats (e.g. epub + html + pdf) the book supports (in the DB). There is hence broken links since we do not requested this additional format. Code seems to ignore this possibility.

Here's a sample command to use:

python gutenberg2zim -z gut_fr_test.zim -b 18812 -l fr -f epub,html

Looks like the issue is present in many places: cover article (gutenbergtozim/templates/cover_article.html), HTML version of a book (gutenbergtozim/templates/book_infobox.html) and bookshelves (2 times in gutenbergtozim/templates/js/tools.js).

benoit74 commented 1 year ago

@rgaudin @kelson42 It looks like this is indeed a regression induced by my last PR. I will look into it today.

benoit74 commented 1 year ago

I confirmed (again) that the issue was already present before my PR in v1.1.9 As far as I can tell for now, the bug is not present anymore if I comment few strange lines forcing the PDF format for all books.

rgaudin commented 1 year ago

Can you add a condition there to only do it if pdf was requested?

benoit74 commented 1 year ago

I just confirmed this bug has been made more visible by the lines / commit mentioned in my former comment. This change is forcing the presence of a PDF format for all books. It is mentioned in the commit / code this was done because some RDFs where not mentioning the PDF as an available format even if the file was available.

Adding a condition to only do it if pdf was requested is not really an option because it means that when pdf is requested, all books which do not have a pdf will be have broken links in the ZIM.

Rather than adding such condition, I would prefer to remove completely these lines. I can confirm this is feasible while working on https://github.com/openzim/gutenberg/issues/97.

As suggested in this issue 97, I prefer to ask @eshellman / PG to fix the issue upstream (if any is still there) rather than maintaining oddities like this which in the end degrade the overall scraper quality / makes it more complex to understand / maintain. I can definitely understand this is easier to say and to do today than years ago.

There is nevertheless still a "UI" bug because all formats present in DB are displayed on the various pages, no matter the restriction of format specified on the CLI. If you request only HTML for instance, you still get a EPUB button on all books with a potential EPUB format.

Finally, one last question: is there really a reason to force the presence of HTML format (see here)? Especially once https://github.com/openzim/gutenberg/issues/95 has been solved

benoit74 commented 1 year ago

I created two separate issues for clarity, let's focus only the UI bug on this one (my bad for having caused some confusion maybe):