gutenbergtools / ebookmaker

The Project Gutenberg tool to generate EPUBs and other ebook formats.
GNU General Public License v3.0
75 stars 17 forks source link

v0.13 enhancement list #139

Open eshellman opened 1 year ago

eshellman commented 1 year ago

leave comments on this issue for enhancements desired for the next major version of Ebookmaker.

Possibilities (subject to feasibility and community support)

gbnewby commented 1 year ago

Cover page handling: https://github.com/gutenbergtools/ebookmaker/issues/132

eshellman commented 1 year ago

We may be able to address the problem in #132 with a css tweak; the difficulty is in testing rather than in the Ebookmaker code.

asylumcs commented 1 year ago

As far as testing for #132, the suggested code is the default cover image handling from Sigil (https://sigil-ebook.com/sigil/). We can benefit from the implied testing by Sigil users.

gbnewby commented 1 year ago

Generate errors for use of blocklisted items. If someone uses \ for example, this should be an ERROR in ebm, so submitters can see in output.txt (via https://ebookmaker.pglaf.org) they cannot use that.

Over time the blocklist will shrink. The blocklist is in the DP Wiki, but ebm code is the canonical source of truth for what HTML or CSS elements/constructs/syntax/variations/etc. are not allowed.

If there are things that are truly harmless (maybe like \
instead of \
), perhaps that should be a WARNING rather than an ERROR. But anything on the blocklist should be an error.

eshellman commented 1 year ago

it looks like the open source paged.js https://pagedjs.org/ will be a nice path to generate high quality PDF from our HTML5 files. I had dinner with the developers on Friday - they use PG files for their demos!

gbnewby commented 1 year ago

That does look promising. I didn't see the PG examples on their Examples page.

It looks like they support use of some directives in HTML to influence page layout. That could be of interest to DPers. I like what I saw about image resizing.

On Mon, Oct 31, 2022 at 9:12 AM Eric Hellman @.***> wrote:

it looks like the open source paged.js https://pagedjs.org/ will be a nice path to generate high quality PDF from our HTML5 files. I had dinner with the developers on Friday - they use PG files for their demos!

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/139#issuecomment-1297330595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLQBNDUHQAOP3HLPZ23WF7VYTANCNFSM6AAAAAARLQVIMQ . You are receiving this because you commented.Message ID: @.***>

eshellman commented 1 year ago

I also talked to the folks from Benetech (Bookshare) about accessibility - they were happy to hear about EPUB3. I was unable to say how well PG is doing with accessible alt attributes so I'm adding some logging to help us understand how much we're complying with guidelines.

charliehoward4dp commented 1 year ago

Now that DP has made audio files mandatory for all ebooks containing musical scores, please consider adding corresponding support for those files to EBM. All smartphones and many tablets support audio, and the music files typically play for only a few seconds, so they aren't particularly large.

The files will be in a "music" subfolder. In the examples I've seen so far, the links to them are simply <a href="music/xxx.mp3">Listen</a>.

Sometimes in the same folder there's also a corresponding .mxl (compressed MusicXML) file, which is the editable source used to compose the .mp3. When a similar link <a href="music/xxx.mxl">Download MusicXML</a> is clicked, a "Save as" dialog appears in a Browser. I don't know whether or not that should or could be supported by EBM as well.

eshellman commented 1 year ago

@charliehoward4dp take a look at the HTML5 audio element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/audio I can imagine DP developing guidelines for use of this element - that would be a prerequisite for support in Ebookmaker. We would want to forbid autoplay, come up with size limits, etc.

LMCantoni commented 1 year ago

@eshellman Hi Eric, I'm the DP Music Coordinator. Thanks so much for considering Charlie's request regarding audio & MusicXML files. I've been following the Slack discussions, so I know that this is for the future, but it would be so great to have this capability. I certainly agree that autoplay should be forbidden. As for size limits, to give you an idea of the typical mp3 size, I'm post-processing a history of chamber music with many music snippets of a few bars each (mainly string quartets), and the average size is around 300-500K.

eshellman commented 1 year ago

@LMCantoni Super! Where is the best place to have the (somewhat technical, somewhat musical) discussions with people who can help?

charliehoward4dp commented 1 year ago

HERE at Dropbox https://www.dropbox.com/s/oldwqs5fcbsxtxy/audiotest.epub?dl=0 is a hand-modified epub3 containing a playable audio file via the <audio> tag. It works with ADE, Calibre, and Nook. It does not work with Kindle for IOS, iBooks, or Google Play Books. The underlying epub3 was generated by eBookMaker. I added the audio file and enabled the necessary HTML.

LMCantoni commented 1 year ago

@LMCantoni Super! Where is the best place to have the (somewhat technical, somewhat musical) discussions with people who can help?

Would it be worthwhile to have a Slack channel dedicated to the music issue?

LMCantoni commented 1 year ago

HERE at Dropbox https://www.dropbox.com/s/oldwqs5fcbsxtxy/audiotest.epub?dl=0 is a hand-modified epub3 containing a playable audio file via the <audio> tag. It works with ADE, Calibre, and Nook. It does not work with Kindle for IOS, iBooks, or Google Play Books. The underlying epub3 was generated by eBookMaker. I added the audio file and enabled the necessary HTML.

Thanks, Charlie. Not surprisingly, the audio didn't work with Kindle for Android or Kindle for PC.

gbnewby commented 1 year ago

I'd like to see better presentation of multiple creators (authors etc.) in the head section.

Currently, labels are repeated. For example:

The non-generated file in https://www.gutenberg.org/files/69679/69679-h/69679-h.htm has:

Authors: Charles Francis Adams Gilbert Nash Charles Francis Adams III

The generated file in https://www.gutenberg.org/cache/epub/69679/pg69679-images.html has: Author: Charles Francis Adams Author: Charles Francis Adams III Author: Gilbert Nash

The first way is much more visually appealing. Even better would be: Authors: Charles Francis Adams, Gilbert Nash and Charles Francis Adams III

gbnewby commented 1 year ago

Related to my previous comment: I think it's reasonable to only list the title on the first line of the HTML. I think we will do this for the workflow-provided items as well. This is because handling multiple authors and variants is often challenging.

So, instead of: The Project Gutenberg eBook of Wessagusset and Weymouth, by Charles Francis Adams et al.

I'd be pleased with: The Project Gutenberg eBook of Wessagusset and Weymouth

gbnewby commented 1 year ago

I'd like the cover to be the very first thing people see, in all formats. (This is something I'm also working with the production team on, via the Workflow system). This is already done for the ereader formats. For example, see https://www.gutenberg.org/ebooks/69703 where the epub starts with the cover image, but the HTML (both native & generated) doesn't show the cover at all.

I'd prefer the cover to be the first thing people see. Even if it's a generated cover or boring cover. The existing header (metadata) & license blurb can appear afterwards.

eshellman commented 1 year ago

With regard to the repeated authors, what we actually have in the database is a list of creators, which includes illustrators, translators, editors, etc. Since the order of authors is usually presumed to be significant, we shouldn't want to change that in the generated list. but it seems the order may not be preserved in the db - will need to look at that.

Provided we can reconstitute the author order, it's probably easy to do "author1, author2, illustrator1 (ill), illustrator2(ill), translator1 (trl)".

Except.... the Creator names in our database are stored in the form "Lastname, Firstname, Suffix, (Parenthetical)" So there will inevitably be some reconstituted authorlists which are either mangled or ugly. The advantage of putting each creator on a separate line is that it will accurately reflect the contents of the database, which can be updated as need be.

eshellman commented 1 year ago

Incorporating the cover into the HTML5 presentation is a good idea. One thing to consider is that the cover is often used as a representation of the book on other websites so clicking the cover to see the... cover again might not be the best UI, especially on small screens. Of course, seeing the license blurb first is not optimal, either.

The difficult thing will be figuring out whether the cover is already there first - we definitely don't want duplicate covers in the html - that already occurs too often in the epub files.

eshellman commented 1 year ago

sample authors: "Quintus, Smyrnaeus, active 4th century" "Du Bois, W. E. B. (William Edward Burghardt)" "Library of Congress. Copyright Office" "John of Damascus, Saint" "John Murray (Firm)" "Caine, Hall, Sir" "Plato (spurious and doubtful works)"

gbnewby commented 1 year ago

My suggestion was not to fix the weirdness in how authors are presented. The idea is just to not repeat the Author: tag.

Separating by semicolon could work, for example:

Authors: Du Bois, W. E. B. (William Edward Burghardt); John Murray (Firm); Library of Congress. Copyright Office

On Wed, Jan 4, 2023 at 11:02 AM Eric Hellman @.***> wrote:

sample authors: "Quintus, Smyrnaeus, active 4th century" "Du Bois, W. E. B. (William Edward Burghardt)" "Library of Congress. Copyright Office" "John of Damascus, Saint" "John Murray (Firm)" "Caine, Hall, Sir" "Plato (spurious and doubtful works)"

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/139#issuecomment-1371249506, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLSMQVSVVX6IKGSGDHDWQW3KBANCNFSM6AAAAAARLQVIMQ . You are receiving this because you commented.Message ID: @.***>

eshellman commented 1 year ago

luckily there are no author names that include ';'

gbnewby commented 1 year ago

On the creator list (the few comments above this one): I looked in the code, and it seems we are only using the dc.authors field. I.e., generated HTML only includes the Authors, not the Illustrators, Editors or Translators.

I would like those other types of creators to be presented when they exist (there are some other creator types that could be of interest but seem less important.. it might be just as easy to list all creator types when they exist).

Simulated example building on the prior comments above:

Authors: John Murray (Firm); Du Bois, W. E. B. (William Edward Burghardt) Illustrator: Caine, Hall, Sir Editors: Plato; Socrates

... you get the idea? When there is more than one, use "Authors" instead of "Author." When there are multiple, separate with a semicolon. Only list Author/Illustrator/Editor/Translator when those field exist.

Thanks for considering. I was going to make a PR myself but the data structure for the creators (like dc.authors), and the pstyle() function for formatting, made it more challenging than just updating the iteration loops.

eshellman commented 1 year ago

dc.authors is a list of author objects, each of which has a name, a creator role, a birthdate and a deathdate. So editors, translators etc are in fact listed. The list is ordered so separating out the creators by roles may re-order the list of creators. We use the MARC list for roles. The importance of a role may depend on the type of work.

gbnewby commented 1 year ago

Ok, then it sounds like this will require first aggregating each type of role then emitting them in groups like my example.

eshellman commented 1 year ago

It looks like this is already done - look at https://gutenberg.org/ebooks/830

gbnewby commented 1 year ago

Indeed, the multiple roles seem to be listed. But they are not consolidated as in my example above.

See for example: https://gutenberg.org/cache/epub/27991/pg27991-images.html

The creator listing in the HTML5: Author: Georgette Leblanc Author: Maurice Maeterlinck Editor: Frederick Orville Perkins Translator: Alexander Teixeira de Mattos

But should be: Authors: Georgette Leblanc; Maurice Maeterlinck Editor: Frederick Orville Perkins Translator: Alexander Teixeira de Mattos

eshellman commented 1 year ago

How important is consolidation? Consolidation adds considerable complexity to the code and makes it harder for downstream users to parse, or to translate, when that is desired. Currently we only need the singular form of the role name in english. Adding 's' fails for 2 of the roles in our db. We do not consolidate roles in the bibrec page, which would not be able to use the same consolidation code. Issues related to line wrapping are exacerbated.

gbnewby commented 1 year ago

I think consistency is extremely important, and that already has been part of our discussion. The format I suggested is what's been used for a very long time. The main variation is the representation of names in the database isn't always great for presenting as-is, but this issue exists regardless of consolidation.

My suggested format should be straightforward algorithmically. We can discuss as desired to ensure challenges can be addressed.

The bibrec section is a table, and the creators are presented as a hyperlink in that table. I don't mind consolidating those records, but don't see it as needed for consistency. Since it's tabular data, a single key-value pair makes sense to me - i.e., status quo.

The book itself is targeted at human readers, and it seems obvious to me that consolidation is a friendlier way of presenting the creator roles.

Anyone trying to scrape the books for metadata is going down a pathway that we don't support. But even if they did, I don't see how the recommended format is less amenable to automation than the current non-consolidated format. ~ Greg

On Mon, Apr 10, 2023 at 9:06 AM Eric Hellman @.***> wrote:

How important is consolidation? Consolidation adds considerable complexity to the code and makes it harder for downstream users to parse, or to translate, when that is desired. Currently we only need the singular form of the role name in english. Adding 's' fails for 2 of the roles in our db. We do not consolidate roles in the bibrec page, which would not be able to use the same consolidation code. Issues related to line wrapping are exacerbated.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/139#issuecomment-1502003692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVJBF5R7CX4UEW2QPLXAQVXHANCNFSM6AAAAAARLQVIMQ . You are receiving this because you commented.Message ID: @.***>

eshellman commented 1 year ago

In particular, our own author parsing code (used by e.g. online ebookmaker). doesn't handle the suggested consolidated format correctly and would need to be revised. The format that ebookmaker expects (not my code) is comma delimited, not semicolon delimited. Yes, we need to be able to scrape our own files.

It's ironic that you use 27991 as an example, because the book has only one author, Georgette Leblanc (a.k.a. Madame Maurice Maeterlinck) https://en.wikipedia.org/wiki/Georgette_Leblanc