jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.87k stars 3.39k forks source link

Add EPUB writer tests #7585

Open jgm opened 3 years ago

jgm commented 3 years ago

Currently we have reader tests but not writer tests.

We should also hook epubcheck into the format-validation action, to ensure that the produced epubs are valid.

archfrog commented 2 years ago

I just ran epubcheck v4.2.6 and v5.0.0-beta-2 on an EPUB generated by Pandoc v2.19.2 for Windows x64. I got a few errors:

R:\>java -jar epub\epubcheck-4.2.6\epubcheck.jar "foo.epub"
Validating using EPUB version 3.2 rules.
ERROR(RSC-005): foo.epub/EPUB/text/ch001.xhtml(27,81): Error while parsing file: The a element must not appear inside a elements.
... (seven similar errors removed for brevity)

Check finished with errors
Messages: 0 fatals / 8 errors / 0 warnings / 0 infos

EPUBCheck completed

R:\>java -jar epub\epubcheck-5.0.0-beta-2\epubcheck.jar "foo.epub"
Validating using EPUB version 3.3 rules.
ERROR(RSC-005): foo.epub/EPUB/text/ch001.xhtml(27,81): Error while parsing file: The a element must not appear inside a elements.
... (seven similar errors removed for brevity)

Check finished with errors
Messages: 0 fatals / 8 errors / 0 warnings / 0 infos

EPUBCheck completed

The offending code looks like this:

<body epub:type="bodymatter">
<section id="indholdsfortegnelse" class="level1 TOC-Heading" data-number="1">
<h1 class="TOC-Heading" data-number="1">Indholdsfortegnelse</h1>
<p><a href="ch002.xhtml#indledning">Indledning <a href="ch002.xhtml#indledning">1</a></a></p>
...

This file was generated from a Word 2021 document with TOC using pandoc -o foo.md foo.docx. The epub appears to work on a PocketBook InkPad Lite ebook reader.

The issue seems fairly trivial to fix, but I don't know Haskell, unfortunately so I can't submit a PR.

Let me know if you need more info or if there's anything I can do to help.

jgm commented 2 years ago

It would be helpful to have the Word document (or a big enough fragment of it to produce this result).

jgm commented 2 years ago

Note: the pandoc types (unfortunately) don't prevent you from nesting a link within a link. My guess is that something in the docx is getting parsed as a link within a link (where you have the paragraph text "Inledning 1"). This might be something we can address in the docx reader. Alternatively, we could implement some logic in the EPUB writer to walk the AST and fix things like this. But we'll have a better idea when we can see the docx source.

jgm commented 2 years ago

I'm adding something to the HTML writer (which will also affect EPUB) that prevents <a> inside <a>. That should fix this particular issue.

archfrog commented 2 years ago

The error occurs when using this command:

pandoc -o "Sandheden om Gud.epub" "Sandheden om Gud.docx"

It can be seen in the embedded EPUB/text/ch001.xhtml file on line 27. I'm using Sigil to inspect the output with.

The source file is attached below:

Sandheden om Gud.docx

Thanks for the change, though :-)

jgm commented 2 years ago

OK, looks like it's the translation of the table of contents that gives the issue:

, Para
    [ Link
        ( "" , [] , [] )
        [ Str "Indledning"
        , Space
        , Link ( "" , [] , [] ) [ Str "1" ] ( "#indledning" , "" )
        ]
        ( "#indledning" , "" )

I think it's dealt with sufficiently by the change I just made. docx apparently allows a hyperlink inside hyperlink text; HTML doesn't.