gutenbergtools / ebookmaker

The Project Gutenberg tool to generate EPUBs and other ebook formats.
GNU General Public License v3.0
84 stars 18 forks source link

reflow credits for .txt #220

Open gbnewby opened 8 months ago

gbnewby commented 8 months ago

An alert reader noticed some anomalies in how generated .txt files are laid out:

For that last point, I know that some of our 508 fields have embedded CR/LF. But for more recent books, 508 should be just one line. So, wrapping would seem safe in those circumstances.

Thanks.

eshellman commented 8 months ago

I'd appreciate an annotated example.

As the START and END markers are important for downstream automated use, I don't want to wrap them.

I have no way to apply different formatting to "recent books".

Will implement first two suggestions if they're easy.

eshellman commented 8 months ago

they were easy

gbnewby commented 8 months ago

Thanks for that.

I understand about not wrapping the sentinel lines. I'll probably forget in the future and ask again - apologies in advance.

For the credits, how about reflowing what's in the 508 field rather than blindly rewrapping? Reflowing to 72-80 characters wide will fix any in the 508 fields.

eshellman commented 8 months ago

In what situations will a user's text viewer not reflow the text unless desired?

eshellman commented 8 months ago

remember that there are urls in the credits which can break on reflow.

gbnewby commented 8 months ago

Reflow: In the link I sent earlier, it was not reflowed in the viewer I used (Firefox). I don't see how a viewer would know to reflow to 70-80 characters like the rest of the file.

The anomalous appearance reported, which I confirmed, is that the whole file comes with line lengths as expected (i.e., 70-80 characters more or less), EXCEPT for the sentinel lines and credits.

For URLs: I was thinking that reflowing would only insert line breaks on whitespace. Not stuff like punctuation. I.e., like the Unix "fold -s" command or /\s+/ regular expression. A Unixy way to do this would be to put the input through a sequence like sed 's/\s+/ /' | sed 's/\n/ /' | fold -s but slightly more intelligently to handle different line endings.

On Fri, Mar 1, 2024 at 10:19 AM Eric Hellman @.***> wrote:

remember that there are urls in the credits which can break on reflow.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/220#issuecomment-1973580801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLQLH4ZJOLJTB2OLFHLYWC2CDAVCNFSM6AAAAABEA3WPU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGU4DAOBQGE . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 8 months ago

post the link, please.

Why would someone use firefox for plain text?

My view is that messing with linebreaks in metadata inherently lowers its value for consumers of metadata. There needs to be other value added that outweighs the text pollution.

gbnewby commented 8 months ago

post the link, please.

Here's one. The behavior seems consistent with what I originally reported in this issue: https://www.gutenberg.org/cache/epub/70889/pg70889.txt

As with all recent books, the 508 field doesn't have any or combinations. Rewrapping those should be trivial.

Rewrapping older 508 catalog entries that have embedded or combinations would require reflowing to a single line prior to rewrapping. But see below for a mitigation approach.

Why would someone use firefox for plain text?

That's not really the point. The point is that the whole entire book is wrapped at 70-80 characters, per PG's usual practice.

Except for the credits and sentinel line.

I understand the logic in not wrapping the sentinel line. I think wrapping the credit line is desirable so it's aligned with the margins in the rest of the book.

Reflowing within the viewer isn't really an issue. If you shrink your viewing window smaller than the margins, things are going to look pretty ragged.

My view is that messing with linebreaks in metadata inherently lowers its value for consumers of metadata. There needs to be other value added that outweighs the text pollution.

We're not changing the metadata, we're changing the generated output.

However, one thing we could do is reflow all the 508 fields in the back catalog to remove extraneous or combinationations. Then, all EBM would do is rewrap (on whitespace boundaries), without needing to reflow.

I see a lot of value in consistency. Editing metadata is something we do frequently, and if we were to update the 508 fields it would be a gift to downstream consumers of greater consistency.

Message ID: @.***>

eshellman commented 8 months ago

added the reflow for 508 to v13 todo list. Due to implementation details, it's not "easy".

gbnewby commented 8 months ago

Thanks.

On Fri, Mar 1, 2024 at 12:32 PM Eric Hellman @.***> wrote:

added the reflow for 508 to v13 todo list. Due to implementation details, it's not "easy".

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/issues/220#issuecomment-1973795665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX22OY7EQEIMCD34RDYWDJTXAVCNFSM6AAAAABEA3WPU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTG44TKNRWGU . You are receiving this because you authored the thread.Message ID: @.***>