gutenbergtools / ebookmaker

The Project Gutenberg tool to generate EPUBs and other ebook formats.
GNU General Public License v3.0
70 stars 17 forks source link

implement pdf output with pagedjs #215

Open eshellman opened 4 months ago

eshellman commented 4 months ago

https://pagedjs.org/

eshellman commented 4 months ago

first try, with zero customization, is pretty impressive. https://www.dropbox.com/scl/fi/csg704t7252c0l8jgfp07/10636.pdf?rlkey=wlhpem4xpehflhin0w7i96obo&dl=0 there is an issue with the backlinks on the citations, probably caused by absolute positioning. The file was generated from https://www.gutenberg.org/cache/epub/10636/pg10636-images.html

eshellman commented 4 months ago

removing the position: absolute rules fixes the only problem I see here.

asylumcs commented 4 months ago

does that fix the footnotes as well?

eshellman commented 4 months ago

Yes, removing the position:absolute from the text's css fixes the example above.

We probably want to add header and footer text as described here: https://pagedjs.org/documentation/7-generated-content-in-margin-boxes/ I will start with "Project Gutenberg, https://gutenberg.or/ebooks/#####" on the bottom and the book's make_pretty_title(size=80) on the top. I'm thinking that pagenumbers are going to be confusing, so better to leave them out? Also, the default config has a gutter, I think that shoule be omitted.

Are people going to want different paper sizes?

eshellman commented 4 months ago

46419.pdf

This example, chosen because it has music, is definitely less polished. We see that all the links to sound and pdf are removed (nice trick!) but also there are some missing images, including half the score on page 148. see https://gutenberg.org/cache1/epub/46419/46419-h.html#Page_132

eshellman commented 4 months ago

55215.pdf This one looks very good - I've added the header and footer text, and made sure we have page breaks after/before our boilerplate header/footer.

eshellman commented 4 months ago

There's something funny with measurements in pagedjs. I can't zero the left margin, and images get dropped even though they should fit. I think I've tricked it into working. Here's a sample output for the most recently released book: pg72955.pdf

gbnewby commented 4 months ago

72955 looks good. The only anomaly I noticed is at the very end: the Footnotes have some over-striking.

I think adding generated page numbers in the footer would make sense. If nothing else, it will help people keep track of what page they were on. Of course, those books with paginated indices will be totally wrong - but they usually also have named anchor hyperlinks to the right place in the books.

If possible, I think a nice header/footer would be something like this Header: Chapter title (centered, possibly truncated) Footer: Project Gutenberg (left), book title (centered, possibly truncated), pagenumber (right)

... though I'm not sure whether the footer will look too busy? Page number could go top right, instead.

... we could also consider different header/footer combinations for alternate pages, like printed books often have.

eshellman commented 4 months ago

It took a while to figure out how to fix the overlapping text in 10636. Good to learn more about pagedjs. There was also a problem with margins, which turned out to be the same issue.

pagedjs works by manipulating the page's DOM using flexbox. Each page works like a column in a very very wide page. As a result, the body element, and styles attached to the body element, don't work properly. To fix this, we'll need to remove all css properties from the body element an re-attach them to the div element introduced by the pagedjs' scripts manipulation. So in 10636, for example, we need to replace:

    margin-left: 10%;
    margin-right: 10%
    }

with:

@media screen {
body {
    margin-left: 10%;
    margin-right: 10%
    }}
.pagedjs_page_content > div {
    margin-left: 10%;
    margin-right: 10%
    }

so ebookmaker will need to do a bit of work to so that our files will render beautifully.

eshellman commented 4 months ago

In another text, I discovered that pagedjs has problems with overflow: auto. We'll need to change those to 'overflow: visible` with media query rules. This is to be expected because pdfs don't have scroll bar boxes!

eshellman commented 4 months ago

Performance will be an issue as well; most likely we'll want to run pdf re-rendering separately from our other ebook production.

gbnewby commented 4 months ago

This is all sounding great to me. Thanks for perseverance on all the nuances.

I'm not worried about keeping our computers busy with rendering, and agree that we might want to separate PDF processing from the other jobs that make generated content.

On Thu, Feb 22, 2024 at 7:01 AM Eric Hellman @.***> wrote:

Performance will be an issue as well; most likely we'll want to run pdf re-rendering separately from our other ebook production.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/215#issuecomment-1959510605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVXUISBUFEZEMGLKMDYU5FTLAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGUYTANRQGU . You are receiving this because you commented.Message ID: @.***>

eshellman commented 4 months ago

Running head chapter titles is doable, but probably not for most of the backfile. The suggested way is to use heading elements, for example, h2. Unfortunately the backfile is inconsistent with the use of headings, for example by using multiple h2 elements to make line breaks. So we would get errata reports from this.

Prospectively, we can certainly do this, by asking submitters for specific markup for chapter titles. So in my tests, I've used the book title in the running head, using the version of the title that omits subtitle.

For the Footer, I've been trying "Project Gutenberg, ". but the method for getting the url for the book is currently not working.

Page numbers are tricky, and need discussion. Many books include original page numbers with reasonably uniform markup, and these could be printed in the side margin, for example. If we print pdf page numbers there will be producers who want the front and back matter numbered separately.

Crazy idea: maybe print percentages?

eshellman commented 4 months ago

Or even crazier, a percentage bar? (not hard in css)

gbnewby commented 4 months ago

For headers/footers, perhaps we could have an optimal approach based on best practices, and then a couple of lesser approaches when the HTML markup isn't regular enough.

The idea of doing a running chapter header when

is present feels great for an optimal approach.

Putting the embedded print page numbers in the right margin is definitely desirable. The Kobo e-reader does that already, actually (though the footer page numbers are not accurate due to some dynamic between EPUB and kEPUB formats or something else...).

I realize it adds complexity to have a couple of fallback methods for headers & footers. It seems the complexity might be worth it, though, since we'll end up with many books that have a fantastic look.

On Thu, Feb 22, 2024 at 8:19 AM Eric Hellman @.***> wrote:

Running head chapter titles is doable, but probably not for most of the backfile. The suggest way is to use heading elements, for example, h2. Unfortunately the backfile is inconsistent with the use of headings, for example by using multiple h2 elements to make line breaks. So we would get errata reports from this.

Prospectively, we can certainly do this, by asking submitters for specific markup for chapter titles. So in my tests, I've used the book title in the running head, using the version of the title that omits subtitle.

For the Footer, I've been trying "Project Gutenberg, ". but the method for getting the url for the book is currently not working.

Page numbers are tricky, and need discussion. Many books include original page numbers with reasonably uniform markup, and these could be printed in the side margin, for example. If we print pdf page numbers there will be producers who want the front and back matter numbered separately.

Crazy idea: maybe print percentages?

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/215#issuecomment-1959671487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLUCGNUSQOUKVSEUOVTYU5OYPAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGY3TCNBYG4 . You are receiving this because you commented.Message ID: @.***>

eshellman commented 4 months ago

OK here's a sample with page numbers and running heads and foots (first 300 pages) 10636.pdf

For this book, h3 would have been better for heads, but we can only pick one thing. An empty h2 sets the head empty for 200 pages or so.

there's some text overlap on p54-57, but overall I think this is spectacular!

gbnewby commented 4 months ago

Looks great!

I had suggested earlier, for EPUB, that the first page should be the cover image.

Then, boilerplate can be the 2nd page .. i.e., a verso page

On Thu., Feb. 22, 2024, 11:37 a.m. Eric Hellman, @.***> wrote:

OK here's a sample with page numbers and running heads and foots (first 300 pages) 10636.pdf https://github.com/gutenbergtools/ebookmaker/files/14378012/10636.pdf

For this book, h3 would have been better for heads, but we can only pick one thing. An empty h2 sets the head empty for 200 pages or so.

there's some text overlap on p54-57, but overall I think this is spectacular!

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/215#issuecomment-1960037395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLRVFBP5J3WRISJ4LXTYU6GAHAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGAZTOMZZGU . You are receiving this because you commented.Message ID: @.***>

gbnewby commented 4 months ago

Also, I don't think you need PG in both the header and footer. Just footer is enough

On Thu., Feb. 22, 2024, 12:01 p.m. Greg Newby, @.***> wrote:

Looks great!

I had suggested earlier, for EPUB, that the first page should be the cover image.

Then, boilerplate can be the 2nd page .. i.e., a verso page

On Thu., Feb. 22, 2024, 11:37 a.m. Eric Hellman, @.***> wrote:

OK here's a sample with page numbers and running heads and foots (first 300 pages) 10636.pdf https://github.com/gutenbergtools/ebookmaker/files/14378012/10636.pdf

For this book, h3 would have been better for heads, but we can only pick one thing. An empty h2 sets the head empty for 200 pages or so.

there's some text overlap on p54-57, but overall I think this is spectacular!

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/215#issuecomment-1960037395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLRVFBP5J3WRISJ4LXTYU6GAHAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGAZTOMZZGU . You are receiving this because you commented.Message ID: @.***>

eshellman commented 4 months ago

Also, I don't think you need PG in both the header and footer. Just footer is enough On Thu., Feb. 22, 2024, 12:01 p.m. Greg Newby, @.> wrote: Looks great! I had suggested earlier, for EPUB, that the first page should be the cover image. Then, boilerplate can be the 2nd page .. i.e., a verso page On Thu., Feb. 22, 2024, 11:37 a.m. Eric Hellman, @.> wrote: > OK here's a sample with page numbers and running heads and foots (first > 300 pages) > 10636.pdf > https://github.com/gutenbergtools/ebookmaker/files/14378012/10636.pdf > > For this book, h3 would have been better for heads, but we can only pick > one thing. An empty h2 sets the head empty for 200 pages or so. > > there's some text overlap on p54-57, but overall I think this is > spectacular! > > — > Reply to this email directly, view it on GitHub > <#215 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AFQRDLRVFBP5J3WRISJ4LXTYU6GAHAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGAZTOMZZGU > . > You are receiving this because you commented.Message ID: > @.***> >

I agree, but the PG comes from the first H2 and disappears after the front matter. The running text selector are currently rather limited.

eshellman commented 4 months ago

Looks great! I had suggested earlier, for EPUB, that the first page should be the cover image. Then, boilerplate can be the 2nd page .. i.e., a verso page On Thu., Feb. 22, 2024, 11:37 a.m. Eric Hellman, @.> wrote: OK here's a sample with page numbers and running heads and foots (first 300 pages) 10636.pdf https://github.com/gutenbergtools/ebookmaker/files/14378012/10636.pdf For this book, h3 would have been better for heads, but we can only pick one thing. An empty h2 sets the head empty for 200 pages or so. there's some text overlap on p54-57, but overall I think this is spectacular! — Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLRVFBP5J3WRISJ4LXTYU6GAHAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGAZTOMZZGU . You are receiving this because you commented.Message ID: @.>

If it were easy, it would have already been done.

eshellman commented 4 months ago

This ia not nearly as good as I thought, there's a vertical margin problem that is dropping the bottom two lines across many page breaks

eshellman commented 4 months ago

Turns out I found a bug with use of blockquote. I think the effort of chasing down the many problems exposed by PG's use will result in big benefits for book production in general.

gbnewby commented 4 months ago

I like the direction this is heading.

On Thu, Feb 22, 2024 at 3:08 PM Eric Hellman @.***> wrote:

Turns out I found a bug with use of blockquote. I think the effort of chasing down the many problems exposed by PG's use will result in big benefits for book production in general.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/215#issuecomment-1960398584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLWIKXCFVLASQWE4G4DYU66WTAVCNFSM6AAAAABC3C4AHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGM4TQNJYGQ . You are receiving this because you commented.Message ID: @.***>

tangledhelix commented 4 months ago

Crazy idea: maybe print percentages?

That's not crazy at all - for example Kindles do exactly that.

I couldn't remember the exact behavior so I grabbed my Kindle Paperwhite just now, where I'm currently smooth-reading. In the footer it displays a page number in the lower left, and a percentage in the lower right.

The percentage is surely auto-calculated by the device. The page number doesn't change on every page-turn, and sometimes a page number is skipped over. I assume page numbers come from the epub3 file. Perhaps the page number that's in effect at the beginning of the current viewport.

Here's an example, FWIW. I would guess Kindle isn't unique in this behavior, but I don't have other devices to check.

IMG_0794