Closed gbnewby closed 1 year ago
OK, I'll work from this PR; will merge the other two PRs with this one if needed.
@gbnewby : On the wrapping of the START and END lines, it appears to me that you want
*** START OF THE PROJECT GUTENBERG EBOOK Long, wordy and really verbose ***
to be replaced with
*** START OF THE PROJECT GUTENBERG EBOOK Long, wordy and really verbose ***
The way this is implemented results in the entire file being wrapped, which mangles a lot of things that are ascii formatted in the back file, including tables, poetry, as well as files where the line breaks represent line breaks for the original text, as well as any txt file with line breaks at longer than 72 characters. Note that even when the text has been wrapped at 72, soft breaks, spelling corrections, etc, may result in some line longer than 72. I'm pretty sure you didn't mean to do this.
line breaking titles also makes it harder for downstream users to identify/extract the "sentinel" lines because of needing selective multiline grep.
Similarly, wrapping metadata is problematic, especially with intentionally multiline titles.
For concatenating "most recently updated, when an update is present it will always need wrapping, in which case maybe it's better to have the "updated" on a new line?
I have implemented all the objectives except wrapping. I can easily add re-wrapping and indenting for the credits and all the metadata, if that is strongly desired. My guess is that wrapped data will be perceived as an error for several percent of the backfile and will result in a maintenance burden.
here is an example output for the header:
The Project Gutenberg eBook of The story of chamber music
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title: The story of chamber music
Author: Nicholas Kilburn
Release date: March 4, 2023 [eBook #70203] Most recently updated: August 3, 2023
Language: English
Credits: Andrew Sly, MFR, Linda Cantoni, and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive)
*** START OF THE PROJECT GUTENBERG EBOOK THE STORY OF CHAMBER MUSIC ***
Here's the input:
The Project Gutenberg eBook of The story of chamber music, by
Nicholas Kilburn
This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.
Title: The story of chamber music
Author: Nicholas Kilburn
Release Date: March 4, 2023 [eBook #70203]
Language: English
Produced by: Andrew Sly, MFR, Linda Cantoni, and the Online Distributed
Proofreading Team at https://www.pgdp.net (This file was
produced from images generously made available by The
Internet Archive)
*** START OF THE PROJECT GUTENBERG EBOOK THE STORY OF CHAMBER
MUSIC ***
The TOC has entries like
Russian chamber music—Glinka—Quartett by Ippolitoff-Ivanoff—Quartett
by Gretchaninoff—Mozart on melody—Russian
schools of musical thought—Belaieff—String Quartett on
name Belaieff—Arensky—Trio in D minor: Arensky—Sokoloff—Tanyeëff—
Kopyloff—Tschaïkovsky 133
which I don't think will look good re-wrapped.
also, this fixes a problem with missing whitespace in multiline metadata.
also, instead of re-wrapping metadata, we can preserve the original line breaks; currently these are stripped; the original whitespace (but not line breaks) MUST be stripped stripped during metadata acquisition.
This approach sounds fine. I'd like to see it in action, and Roger can supply some test cases with long lines - just ask.
Wrapping: Basically, you are saying that wrapping just the metadata isn't feasible. Definitely we do not want to rewrap the .txt - that would be perilous.
Metadata: Preserving line endings in metadata seems like a good approach.
Left-align metadata: We do want to get the indentation "right" in metadata. Basically the whole header should be left-justified, EXCEPT that a "Most recently updated: " line, if present, should be (a) indented, and (b) directly below the Posting date (no extra blank line).
Sentinel: I think it's tolerable to have the sentinel line unwrapped. This should include only the main title, not the subtitle or author (same as the first line in your example). I think I prefer the sentinel lines starting with *** to be left-justified rather than indented or centered (and I don't know what centered even means, for lines that are longer than 72-80 characters).
On Mon, Aug 7, 2023 at 2:02 PM Eric Hellman @.***> wrote:
also, this fixes a problem with missing whitespace in multiline metadata.
also, instead of re-wrapping metadata, we can preserve the original line breaks; currently these are stripped; the original whitespace (but not line breaks) MUST be stripped stripped during metadata acquisition.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/pull/200#issuecomment-1668573580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLWLAHYI2JH4S5PS5M3XUFJX3ANCNFSM6AAAAAA3FP5RWE . You are receiving this because you were mentioned.Message ID: @.***>
Here's what it looks like with 'updated' on its own line, line breaks preserved on the credits, and added a subtitle to show how that looks.
The Project Gutenberg eBook of The story of chamber music
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title: The story of chamber music
Complete with subtitles!
Author: Nicholas Kilburn
Release date: March 4, 2023 [eBook #70203]
Most recently updated: August 8, 2023
Language: English
Credits: Andrew Sly, MFR, Linda Cantoni, and the Online Distributed
Proofreading Team at https://www.pgdp.net (This file was
produced from images generously made available by The
Internet Archive)
*** START OF THE PROJECT GUTENBERG EBOOK THE STORY OF CHAMBER MUSIC ***
I that looks good, I'll merge it and deploy soon, as I found some bugs while reviewing the recent conversion logs
Thanks for this. What a great improvement!
I'd like to see one with more creators, to confirm how indentation and ordering is done. 69330 might be a good choice.
This is a great opportunity to strive for maximum consistency between generated text and generated HTML.
Is it viable to have the same hanging indentation for text as for HTML? For example, I'm looking at this pair: https://www.gutenberg.org/cache/epub/69330/pg69330-images.html https://www.gutenberg.org/files/69330/69330-0.txt
I'm personally not that keen on hanging indents that are variable to align with the first line (like, 9 spaces for Title and 11 spaces for credits). I would be happier to either have everything left-justified, or use tabs or a regular 8-space indent). I understand that wrapping for text is perilous, and that for credits we want to not rewrap whatever comes out of the catalog database. It seems the pg-machine-header CSS does this, the latest text above in this issue does a fixed indent after the first line. In other words, what you have for indentation might be just fine if a fixed-length hanging indent + wrapping isn't feasible.
For the first visible line of the file, I think we have settled on just using the title. No subtitle or any of the authors/creators. This is for a cleaner look, and also because the subtitle and authors/creators often look weird (like in 69330 where we miss the first author entirely). This is what you've done for the text example, and it's desirable for the HTML (let me know if you'd like me to create a new issue for that).
Summing up my comments: I'd like to see an example with multiple creators. Everything else above is for possible consideration.
On Tue, Aug 8, 2023 at 1:41 PM Eric Hellman @.***> wrote:
Here's what it looks like with 'updated' on its own line, line breaks preserved on the credits, and added a subtitle to show how that looks.
The Project Gutenberg eBook of The story of chamber music
This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.
Title: The story of chamber music Complete with subtitles!
Author: Nicholas Kilburn
Release date: March 4, 2023 [eBook #70203] Most recently updated: August 8, 2023
Language: English
Credits: Andrew Sly, MFR, Linda Cantoni, and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive)
START OF THE PROJECT GUTENBERG EBOOK THE STORY OF CHAMBER MUSIC
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/pull/200#issuecomment-1670279857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLTTAADUSF56IFF6MZLXUKQBBANCNFSM6AAAAAA3FP5RWE . You are receiving this because you were mentioned.Message ID: @.***>
Right, currently there are 8 space indents for txt and 7.5ex indents for html. The simple title in the first line is what's in production (since July 20) https://github.com/gutenbergtools/ebookmaker/commit/3c6077b4ddac6a022cbf4990747f2556fa7ba141
@gbnewby It still needs som work to get the multiple authors to look right. One question:
Title: The story of chamber music
Complete with subtitles!
Author: Nicholas Kilburn
Second Author
Illustrator: Pablo Picasso
Second Illustrator
do we want a vertical space between different creator types? It might be a bit tricky.
Ok, the indentation looks good.
I think it's ok to not have an additional vertical space (blank line) between different creator types. The main things are to (a) list authors first, and (b) group them under each creator type, rather than repeating Author: Author: Illustrator: Illustrator when there are multiples.
On Wed, Aug 9, 2023 at 2:37 PM Eric Hellman @.***> wrote:
@gbnewby https://github.com/gbnewby It still needs som work to get the multiple authors to look right. One question:
Title: The story of chamber music Complete with subtitles!
Author: Nicholas Kilburn Second Author Illustrator: Pablo Picasso Second Illustrator
do we want a vertical space between different creator types? It might be a bit tricky.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookmaker/pull/200#issuecomment-1672186277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLSCIE446DRCYZDOPW3XUP7JHANCNFSM6AAAAAA3FP5RWE . You are receiving this because you were mentioned.Message ID: @.***>
I'm happy with this , tested various scenarios. Will test on v12, expect to deploy later today.
Detail on what this series of 3 PRs is trying to fix:
header line does not wrap.
excessive, irregular vertical spacing before the metadata.
none of the metadata wraps and has inconsistent vertical spacing.
“Most recently updated” is concatenated to the Release date line instead of being on a second line, indented.
START OF line does not wrap.
END OF line does not wrap.
in the footer, there are two vertical spaces between paragraphs, should be one.
the two lines after START: FULL LICENSE should each have one space before the line.
"Section 1.” line needs to be on two lines in the source to show up as two lines in the text.
Same for “Section 4.” line. Text comes from stripping tags, so two lines in text requires two lines in the source file for the page.