gutenbergtools / libgutenberg

Common files used by Project Gutenberg python projects.
GNU General Public License v3.0
6 stars 3 forks source link

Punctuation truncated inappropriately in metadata block #42

Closed gbnewby closed 5 months ago

gbnewby commented 6 months ago

It looks like this is happening in generated files, as well as landing pages, so I'm guessing the issue is in libgutenberg rather than in ebookmaker or autocat3.

The reported issue is that punctuation was incorrectly removed from a catalog title.

In the catalog (correctly, including the period after "Inc.": Edit Delete 245 - Title Statement 4 The girl from Bodies, Inc.

The title line from https://www.gutenberg.org/cache/epub/73523/pg73523-images.html The Project Gutenberg eBook of The girl from Bodies, Inc (loses the trailing period on the abbreviation)

The Title line: Title: The girl from Bodies, Inc

The START OF line: START OF THE PROJECT GUTENBERG EBOOK THE GIRL FROM BODIES, INC

Summary: Catalog is correct. Generated HTML is not. Netiher is the generated UTF-8 text.

eshellman commented 6 months ago

Because title ending periods are customary in library catalog records, trailing periods in the title are removed. I suggest that this should be fixed in the catalog by replacing the trailing period with unicode FF0E "fullwidth full stop" or unicode 2024 "one dot leader". Or possibly there is a library catalog convention for this case. Should let the cataloguer decide.

gbnewby commented 6 months ago

It seems this needs further discussion. I checked with the catalog team, and they confirmed that periods (and ellipses) in title and subtitle fields in the PG catalog database are significant and should be included in the landing pages & in-book metadata.

They did not think it was a good idea to use a character that is not a period, but looks like one, in lieu of an actual period.

Here are a couple more problematic titles:

https://www.gutenberg.org/cache/epub/33314/pg33314-images.html https://www.gutenberg.org/cache/epub/60671/pg60671-images.html

792 instances were found using psql-> select pk, title from books where title LIKE '%.';

Eyeballing them indicates that nearly all instances of punctuation should be displayed.

Thanks.

eshellman commented 6 months ago

It's an easy change: https://github.com/gutenbergtools/libgutenberg/blob/052cfde969da9d9a3457b91a3cb85fb737e43392/libgutenberg/DublinCore.py#L468

gbnewby commented 6 months ago

Thanks for taking care of this. Let me know if there is more you need from me.

On Wed, May 8, 2024 at 8:31 AM Eric Hellman @.***> wrote:

It's an easy change:

https://github.com/gutenbergtools/libgutenberg/blob/052cfde969da9d9a3457b91a3cb85fb737e43392/libgutenberg/DublinCore.py#L468

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/libgutenberg/issues/42#issuecomment-2100851778, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLRSW5MP7SDRD6ZI3VLZBJAOZAVCNFSM6AAAAABHG5AWLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQHA2TCNZXHA . You are receiving this because you authored the thread.Message ID: @.***>