Open jgm opened 3 years ago
I just ran epubcheck
v4.2.6 and v5.0.0-beta-2 on an EPUB generated by Pandoc v2.19.2 for Windows x64. I got a few errors:
R:\>java -jar epub\epubcheck-4.2.6\epubcheck.jar "foo.epub"
Validating using EPUB version 3.2 rules.
ERROR(RSC-005): foo.epub/EPUB/text/ch001.xhtml(27,81): Error while parsing file: The a element must not appear inside a elements.
... (seven similar errors removed for brevity)
Check finished with errors
Messages: 0 fatals / 8 errors / 0 warnings / 0 infos
EPUBCheck completed
R:\>java -jar epub\epubcheck-5.0.0-beta-2\epubcheck.jar "foo.epub"
Validating using EPUB version 3.3 rules.
ERROR(RSC-005): foo.epub/EPUB/text/ch001.xhtml(27,81): Error while parsing file: The a element must not appear inside a elements.
... (seven similar errors removed for brevity)
Check finished with errors
Messages: 0 fatals / 8 errors / 0 warnings / 0 infos
EPUBCheck completed
The offending code looks like this:
<body epub:type="bodymatter">
<section id="indholdsfortegnelse" class="level1 TOC-Heading" data-number="1">
<h1 class="TOC-Heading" data-number="1">Indholdsfortegnelse</h1>
<p><a href="ch002.xhtml#indledning">Indledning <a href="ch002.xhtml#indledning">1</a></a></p>
...
This file was generated from a Word 2021 document with TOC using pandoc -o foo.md foo.docx
. The epub appears to work on a PocketBook InkPad Lite ebook reader.
The issue seems fairly trivial to fix, but I don't know Haskell, unfortunately so I can't submit a PR.
Let me know if you need more info or if there's anything I can do to help.
It would be helpful to have the Word document (or a big enough fragment of it to produce this result).
Note: the pandoc types (unfortunately) don't prevent you from nesting a link within a link. My guess is that something in the docx is getting parsed as a link within a link (where you have the paragraph text "Inledning 1"). This might be something we can address in the docx reader. Alternatively, we could implement some logic in the EPUB writer to walk the AST and fix things like this. But we'll have a better idea when we can see the docx source.
I'm adding something to the HTML writer (which will also affect EPUB) that prevents <a>
inside <a>
. That should fix this particular issue.
The error occurs when using this command:
pandoc -o "Sandheden om Gud.epub" "Sandheden om Gud.docx"
It can be seen in the embedded EPUB/text/ch001.xhtml
file on line 27. I'm using Sigil to inspect the output with.
The source file is attached below:
Thanks for the change, though :-)
OK, looks like it's the translation of the table of contents that gives the issue:
, Para
[ Link
( "" , [] , [] )
[ Str "Indledning"
, Space
, Link ( "" , [] , [] ) [ Str "1" ] ( "#indledning" , "" )
]
( "#indledning" , "" )
I think it's dealt with sufficiently by the change I just made. docx apparently allows a hyperlink inside hyperlink text; HTML doesn't.
Currently we have reader tests but not writer tests.
We should also hook epubcheck into the format-validation action, to ensure that the produced epubs are valid.