kevinboone / txt2epub

A command-line utility for Linux, for making EPUB documents from plain text files
GNU General Public License v3.0
20 stars 4 forks source link

epub created by txt2epub shows warning for each chapter in okular (from libepub) and author field with superfluous tag #4

Open emk2203 opened 4 days ago

emk2203 commented 4 days ago

A generated file, opened in okular, shows a warning libepub (WW): - missing play order in nav point element for each chapter.

The properties popup shows in the author line Author: aut: <author name> (<author name>), which looks odd. (\<author name> is the placeholder)

I have not seen this with other ebooks in okular, so I assume this comes from the txt2epub formatting.

kevinboone commented 4 days ago

Hi. What specific action do I need to take, to reproduce this? To be honest, I've never used Okualar -- I thought it was just for PDFs. txt2epub hasn't been updated for seven years, but I'll fix it if it's not too difficult :)

emk2203 commented 3 days ago

I investigated a bit further and found a program to check epubs for compliance, epubcheck. There's a GUI for it with additional functionality as well.

This shows the errors which okular complains about and some more, even some critical ones.

If you need the source material, I can upload it to WeTransfer, though I suspect that the errors are shared by all output. The source just consists of around 150 chapter files with the simple structure of

Chapter ####: <chapter title>

<body text>

and a cover image with the correct 590x750 dimensions, nothing more.

The command to create the epub was in essence

txt2epub -o myepub.epub --first-lines \
  --author "A funny author pseudonym" --title "This is the title" \ 
  --cover-image cover.png chapter_????.txt

from within the directory.

To be clear, we are talking about non-essential stuff. After sending the output to my kindle via Amazon's conversion page, they show up without any noticeable problem. Same goes for reading them on an Android device with a epub reader there.

The one issue which prompted me to write this bug report (NCX pointers all over the place in okular) is clearly an okular bug since the NCX table works well on an Android epub reader and on the kindle.

But if you want to improve txt2epub to confirm better with the epubcheck by W3C or even output to the newer epub3 spec, I would be more than happy, your program really fills a niche.

It handled epub generation with ease, and embedded XHTML code, like embedded footnotes with a short XHTML snippet were converted gracefully. The 150 chapters per volume were also no problem. Thanks for writing this great piece of software!

kevinboone commented 2 days ago

The "missing play order" message from Okular is probably legitimate -- txt2epub does not specify a play order. In principle, EPUB chapters can be presented in a different order from the one they are listed in the NCX file. If there is no play order, then they are presented in the order they appear. Anyway, this is easy to fix. The other things will need a bit more thought.

kevinboone commented 2 days ago

I suspect that the way that Okular display 'Aut: author_name' is an affectation of Okular. The EPUB 2 specification allows for multiple 'creator' entries, one of which is 'aut', for 'author'. So txt2epub writes

<dc:creator opf:role="aut">Fred Bloggs</dc:creator>

which I believe is correct for EPUB 2. For EPUB 3, I think we should instead write:

<dc:creator id="author">Fred Blogs</dc:creator>
<meta refines="#author" property="role">aut</meta>

which says the same thing, in a different way. But I fear that this content will fail on a reader that only supports EPUB 2 (but I've not tried).

In any case, I'm not very keen to change this, unless it's causing real heartache. I'm not sure how far-reaching the side-effects will be. If it's a really big deal, I could make it optional, with a command-line switch.

kevinboone commented 2 days ago

I've pushed a new version of txt2epub that fixes the 'missing play order' message, as well as some other fussy checks in the Pagina EPUB checker. Please feel free to try it, and let me know how it goes.

emk2203 commented 2 days ago

Thanks! The okular errors are completely gone now. I would agree that the aut: issue is an affectation of okular and can be disregarded. Didn't see anything like this in the Android reader. The Pagina EPUB checker still throws two FATAL level complaints:

$ grep -A 1 FATAL myebook_vol1_log.txt
FATAL (RSC-016) FATAL_preposition "myebook_vol1.epub/file32.html" (line 212, col 2):
   Fatal Error while parsing file: The content of elements must consist of well-formed character data or markup.
--
FATAL (RSC-016) FATAL_preposition "spare_me_great_lord_vol1.epub/file113.html" (line 49, col 67):
   Fatal Error while parsing file: The entity name must immediately follow the '&' in the entity reference.

There are also ERROR (RSC-005) complaints for each chapter, always at the same position:

ERROR (RSC-005) at "myebook_vol1.epub/file27.html" (line 9, col 5):
   Error while parsing file: element "h1" not allowed here; expected the element end-tag, text or element "a", "abbr", "acronym", "applet", "b", "bdo", "big", "br", "cite", "code", "del", "dfn", "em", "i", "iframe", "img", "ins", "kbd", "map", "noscript", "ns:svg", "object", "q", "samp", "script", "small", "span", "strong", "sub", "sup", "tt" or "var" (with xmlns:ns="http://www.w3.org/2000/svg")     

And also sometimes errors like this, also ERROR (RSC-005):

ERROR (RSC-005) at "myebook_vol1.epub/file27.html" (line 51, col 2):
   Error while parsing file: text not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")

which is basically the same, just with text instead of h1 elements. The text complaints occur at different lines, but always in column 2. There are few chapters affected, but if they are, always with 5-6 or 12 occurrences.

I am not sure how these matter, if at all, except for the FATAL ones.

kevinboone commented 2 days ago

Could you provide me with at least one source document? I'm guessing that there's stuff in your text that I'm not properly converting to xhtml. Do you use the "&" sign in your text? That will probably need special treatment, and I don't think I'm handling it well. Probably there are other characters that have a protected meaning in xhtml, and need to be escaped or encoded somehow. I can fix this, if I can reproduce it.

kevinboone commented 1 day ago

In case the problem is with HTML entities, I have pushed a new version that I think handles these things better. However, this does mean some slight changes in the way txt2epub operates -- see 'XHTML support' in the README file. Do please let me know if this helps. If it doesn't, I really need a sample of your input data.

emk2203 commented 1 day ago

I'll try with your new version, but this part about the need to have full XHTML leaves me stumped. There are half a dozen chapters in there with footnotes, where I just added an XHTML snippet (see my post).

If I need to supply these chapters as .xhtml, while the others stay as .txt, would that be enough? And if so, how would a proper .xhtml file look like? Are there standard headers and footers for ebook usage? What is absolutely needed to create such a file?

The download link for the source is https://we.tl/t-UFxsk1DLC9. It expires on 2024-10-27. The overview document contains some cursory notes about the structure of the documents. The archive inside the archive contains the folders with the chapters plus cover image.

kevinboone commented 1 day ago

Here's the problem. Your source files are a mixture of (X)HTML and plain text. For example, in places you use real tags, like and <p...>, while in others you write things that look a bit like tags, like "<< Lurking>>". Similarly, you have, for example, "Guns&Steel", but the "&" has a special meaning in XHTML. You probably just want the "&" sign rendered like "&", but in XHTML that should really be "&" I would guess that a verifier will give you an 'invalid entity' warning, or something like that.

EBook readers are typically based on HTML viewers, and the maintainers of HTML viewers have had decades to develop the logic to handle all the badly-formed HTML that's floating about in the world. So, for example, "" isn't a valid (X)HTML element, so far as I know, even thought it has the structure of one; an HTML renderer will guess that you probably meant the literal text , and display it as such. But it takes a colossal amount of intelligence in software, to handle these kinds of formatting errors, and guess what the author intended.

Until recently, all epub2txt did was copy the user-supplied text into an XHTML file, adding the necessary header and footer boilerplate. It turned blank lines in text files into

...

paragraphs and that sort of thing but, other than that, it did no significant processing. Because EBook viewers are smart, the fact that you're mixing plain text with XHTML constructs doesn't affect the display. But any documents that mix XHTML with plain text that just looks a bit like XHTML, as yours do, will fail validation.

So how can the program distinguish between 'real' XHTML, and text that just looks like XHTML, unless you use some sort of naming convention to indicate what a file contains?

A possibility, I guess, would be to recognize some sort of indicator in a text file that says: what follows is already formatted as XHTML; do not fiddle with it. Any >, <, &, etc., that appear anywhere else would be turned into valid XHTML, while any text marked as pre-formatted would just be passed through unchanged. I'm not sure what syntax would be used for this -- perhaps a line that starts with "." could be treated as literal XHTML, up to the ? A line starting "." is unlikely to appear in ordinary text.

It's no problem for me to restore the original behaviour, and simply leave the stuff that looks like XHTML untouched. Usually it will display correctly. But it will probably fail validation, and there's nothing I can do about that, unless I want to turn a 200-line program into a two million-line program.

I'm open to suggestions, but I don't have time to implement something that's smart enough to guess the author's intentions ;)

emk2203 commented 1 day ago

The contents were not written by me, but scraped from the web, so it would be difficult to make sure that the text contains no offending characters. If I want to turn another web novel into an ebook, the appearance of these characters would be unavoidable without a lot of preprocessing.

My naive assumption was that if a plain text is processed, the critical characters just get escaped, while the small subset of supported markdown get converted. Isn't this what your program was doing?

Obviously, when dealing with some inserted XHTML code, this would need to be refined.

Something like:

The charm of your program for me (and surely for others) lies in the fact that it is so simple and you have complete control, for more complicated things I would use pandoc or similar. So we both share the interest of not turning this from 200 lines into significantly more. Version 0.0.5 was already quite good.

Would the above idea work for you? It's also not relevant to satisfy fussy checkers completely. I am just trying to get a robust epub version which doesn't choke whatever device one is reading this on, now or in the future.

If it is clear which characters get escaped, which are treated as markdown and how special markers for beginning and end of XHTML parts look like, the rules would also be clear and memorable for users of the program, without making the changes overly complicated for you.

If the escaping is too complicated, it would be enough to have a list of these characters somewhere in the README, so that the preprocessing can be done by the user.

kevinboone commented 1 day ago

"special marker --> XHTML code follows, use verbatim until an end marker"

I guess the end marker could just be the end of the line, but there could be a specific marker, if it's necessary to span multiple lines. But it might be easier just to repeat the beginning-of-XHML marker on each line.

The problem with this approach is that you wouldn't be able to write

"Blah blah blah BLAH! blah blah"

you'd have to write

Blah blah blah .BLAH! blah blah

or something. I guess it would also be legitimate to say:

Blah blah blah . BLAH! . blah blah

In fact, it would probably be fine just to say

. Blah blah blah BLAH!. blah blah

because there's nothing in the text that needs any special treatment. It's only when special characters are involved that there's an issue at all.

This kind of approach is less of a problem where there tends to be a chunk of XHTML all together, than if there are bits of it embedded in lines.

Such an approach would be easy to implement. At least, it would be much easier to implement than general logic that could tell that was an XHTML tag and (or whatever) was not.

I've gotten away without thinking much about this issue until now, because I haven't used any special characters (<, >, &...) other than in XHTML. Or, if I did, the EPUB viewer was clever enough to hide my oversights.

Anyway, I suppose the behaviour could be made switchable.

Comments welcome.

emk2203 commented 15 hours ago

I guess the end marker could just be the end of the line, but there could be a specific marker, if it's necessary to span multiple lines. But it might be easier just to repeat the beginning-of-XHML marker on each line.

Sometimes, the XHTML could be within the line and end within the line. Example: When I converted the footnotes in the webnovel, they had the format of word^1 for the footnote reference. I expanded this to word\<sup>\<a href="#fn1" id="r1" epub:type="noteref">[1]\</a>\</sup>.

There was text after the word^1 within the line, which shouldn't be treated as XHTML.

When I typed the above two paragraphs, I had to backslash-escape all the 'smaller than' and 'greater than' signs and the square brackets, otherwise it would have shown up as word[1].

This is what I expect txt2epub to do. Take an array of special characters and escape them as default action. The exception would be any XHTML snippet, which could be designated by a ๐„ (u10101), the 'Aegean Word Separator Dot'. Everything after this character is taken as-is to form the code, until a second ๐„ (u10101) character.

Advantages:

  1. Nobody uses it. If anyone wants to use a small dot in the middle, they use ยท (u00b7) for this.
  2. It's dead simple to memorize and type. Unicode 10101, made by typing Ctrl+Shift+u 10101, can't get much simpler than that.

My proposal:

  1. Look for text between two u10101 characters and leave it as-is.
  2. Look for text between special markdown characters and convert them accordingly.
  3. For the rest, take an array of special characters and escape every single occurence.

For 1. and 2., you would just need to extend your existing logic a little (at least I hope so...)

  1. would need some extra logic, but I hope that just escaping the special characters from the array shouldn't be too much effort.

This would take care of your following concerns, and wouldn't need extravagant logic to implement an XHTML parser.

Disadvantage as far as I can see it:

  1. Texts leave the realm of pure ASCII and enter UTF-8 territory. You would need to have an option to enter arbitrary Unicode characters.
  2. You cannot just paste XHTML into the text, it needs to be preceded and ended by the markers.

In 2024, I don't see these as an issue. People write emoji on their phone, I couldn't even name a device which couldn't handle UTF-8. In terms of typing them, I doubt that anyone will use txt2epub on anything else than a computer, where UTF-8 characters are easy to type with shortcuts on any OS.

It's not much effort to precede and end the XHTML code with the markers before or after pasting.

I've gotten away without thinking much about this issue until now, because I haven't used any special characters (<, >, &...) other than in XHTML. Or, if I did, the EPUB viewer was clever enough to hide my oversights.

I'm coming from a different position, this is why I noticed the shortcomings. My use case are texts where I don't have control over their contents. This is different from starting from scratch and being able to avoid special characters.

Comments welcome.

I hope these ideas are helpful.

Cheers.

1. This is where the linked reference takes you, it works even here on GitHub. You can go back by clicking the number to the left. Neat.

kevinboone commented 9 hours ago

Thanks. I fear the situation might not be as simple as you think, and txt2epub was supposed to be the work of one lunch-break. A similar thing happened with epub2txt: it was supposed to be trivially simple, but people kept asking for enhancements, and it ended up quite complex.

The problem is that txt2epub is not Unicode-aware, because it currently has no reason to be. The only characters it looks for in text, like newline, are single-byte, and that byte can't appear in a multi-byte sequence. The character you're proposing is actually a four-byte UTF-8 sequence -- 0xf0 0x90 0x84 0x81. So the code would have to check this pattern at all positions in a line. And strictly speaking, it would have to ensure that it's not matching the pattern in the middle of some other multi-byte sequence -- unlikely, but not impossible.

Or, the utility would have to work internally in a fixed-length encoding like UTF-32. That's what I had to do with epub2txt in the end, because trying to work out where the boundaries between UFF-8 sequences were, was a real pain.

In addition, I'm not sure whether you expect your XHTML to span multiple lines, and whether you would want the marker character to being on one line and end on another. Implementing this would turn the conversion logic from a single line-by-line scanner to a finite-state machine with lexical categories and what-not.

I just don't have time to implement this, but I'd be very happy to incorporate any implementation you came up with ;)

If you're prepared to take the risk that your marker character might inadvertently match the middle of a different UTF-8 sequence (unlikely), and that both beginning and end instances appear in the same line, I could probably do that without too much work.

emk2203 commented 9 hours ago

If you're prepared to take the risk that your marker character might inadvertently match the middle of a different UTF-8 sequence (unlikely), and that both beginning and end instances appear in the same line, I could probably do that without too much work.

This sounds good. The risk of this character matching the middle of another UTF-8 sequence is negligible. And the restriction to appear in the same line is also sensible. I did this instinctively when I inserted the XHTML snippets into the text, safe to assume that everyone could do this, especially with a notice in the README. There's no real reason to break the line in the middle of a statement.

So if you could implement this without too much work, I would look forward to it. Would this include the escaping of the other special characters?

A big thanks for all your efforts!