openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Consider a better support of ZIM files without books in HTML #95

Open kelson42 opened 4 years ago

kelson42 commented 4 years ago

I think we should maybe consider a better support of ZIM files without HTML. The reasons are:

Currently I see two big reasons to keep the HTML versions: 1 - Full text engine applying to HTML only 2 - Ability to directly see the content

These two things might be fixed with: 1 - Support ability to fulltext index EPUBs (relatively easy) see https://github.com/openzim/libzim/issues/289 2 - Providing readers for multiple platforms within the ZIM... even maybe a pure Web Epub reader?

kelson42 commented 4 years ago

@eshellman This ticket might be of interest for you

Popolechien commented 4 years ago

Sure, sounds good but what would the final output look like compared to what we have now?

kelson42 commented 4 years ago

Sure, sounds good but what would the final output look like compared to what we have now?

@Popolechien Same without the book in HTML directly usable from the browser, in place we would have an info page explaining how to read the EPUB file from the Browsers, mobile, computer, etc...

Popolechien commented 4 years ago

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

kelson42 commented 4 years ago

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

@Popolechien This might be done, can be already done, but this is not the point of the ticket which is about providing a better UX without HTML. Buy maybe you just want to say "I don't think we need that: we should provide one with HTML and one with EPUB and people can only have one or the either and live with that."

Popolechien commented 4 years ago

we should provide one with HTML and one with EPUB and people can only have one or the either and live with that.

Yes.

kelson42 commented 4 years ago

@Popolechien To me this would be a fallback solution. But I believe we might be able to solve the problem properly.

We could be able to solve (2) in an even better manner by using a pure javascript EPUB reader (so for the end user) it would be a similar experience as having the HTML in the ZIM file. We could for example use https://github.com/futurepress/epub.js/

eshellman commented 4 years ago

only tricky thing deploying epub.js is overcoming same-origin javascript issues, but you probably are experienced with that

On Nov 14, 2019, at 11:00 AM, Kelson notifications@github.com wrote:

@Popolechien https://github.com/Popolechien To me this would be a fallback solution. But I believe we might be able to solve the problem properly.

We could be able to solve (2) in an even better manner by using a pure javascript EPUB reader (so for the end user) it would be a similar experience as having the HTML in the ZIM file. We could for example use https://github.com/futurepress/epub.js/ https://github.com/futurepress/epub.js/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzim/gutenberg/issues/95?email_source=notifications&email_token=AAHCGMKVWBMDWPFBYGF7GCDQTVYYBA5CNFSM4JMDYFJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECKPAY#issuecomment-553953155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMKP2IMQALSTQ7XIDM3QTVYYBANCNFSM4JMDYFJA.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

Once #136 implemented, we should be able to implement this ticket. The scraper would download the EPUB, parse it to extra the key words for the search engine. Epub.js should be able to make the EPUB directly readable in the ZIM (to best tested).

rgaudin commented 1 year ago

The most difficult part here is the one that's not been mentioned: the UI. With our generic UI that

What does entries look like? An html shell that displays epub.js on size 100%? Should it include a link/button to download/open the epub should you have an external epub reader?

I believe the search topic deserves its own ticket. https://github.com/openzim/libzim/issues/289 seems like a wrong solution to the problem. We don't want libzim to index epub. If libzim does it, then search results would point to the .epub entry and not to our epub.js shell… If we want to index the shell, then we need the libzim NOT to index .epub ones, otherwise we'll double index size

We'd need a scraper-level epub parser (and html, and pdf). Actually we could already (when also including HTML) build indexdata on the cover article and disable libzim one on the HTML book so that search points to the cover and not the HTML itself.

Now one issue would be that books are very long and epub (and PDF) are paginated. If you're searching for an expression, is it acceptable to just link to the book cover? In a WP article, it's single page so despite being cumbersome, you can easily ^F and find that text again.

In epub.js there is no search-in-book feature (yet??) so if you were not looking for a book but for an extract, it's gonna be useless… and I believe finding books is not what fulltext index is about (home page search does it probably better)

Jaifroid commented 1 year ago

I risk sounding like a broken record, but please remember users with older browsers and OS's, as well as those with restrictive CSPs. HTML is a universal way to access content that is supported everywhere (at least, static HTML). While it's fine if we can include a system in the ZIM to convert EPUB or PDF content to accessible (and searchable) HTML, we would need to be sure that such readers run under old browsers and restrictive CSPs. Otherwise you risk making ZIMs even more inaccessible than they already are. Even a modern Chrome extension can't access the current dynamic UI due to its use of inline JS (#145), and that is only going to get worse with the stricter CSPs in manifest v3 extensions also: https://github.com/kiwix/kiwix-js/issues/755.

So, I agree with the caution expressed by @rgaudin, but for slightly different reasons.

Jaifroid commented 1 year ago

I've just checked, and epub.js doesn't work in IE11. Yes, IE11 is now history, but it's still a good proxy for old browser support...

image

benoit74 commented 10 months ago

For those who do not yet knows about it, integrating an epub and a pdf reader has already been done for kolibri scraper.

There is even a download button for those who prefer to use another reader.

Other questions regarding resulting UI and the creation of multiple ZIMs (all, epub_only, html_only, pdf_only) are still relevant