jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.45k stars 3.37k forks source link

ePub output responds to TeX raw commands #7406

Open elliottslaughter opened 3 years ago

elliottslaughter commented 3 years ago

I find it uninuitive that Pandoc's ePub output appears to respond to raw TeX commands.

Consider the following two files. I would expect these to render effectively identically, when ePub output is chosen.

File test.md:

% Title
% Author

```{=tex}
\ifcsdef{mainmatter}{%
\mainmatter
}{}

Chapter 1

Text.


File `test2.md`:

% Title % Author

Chapter 1

Text.


I process them with the following commands:

pandoc test.md -o test.epub --toc pandoc test2.md -o test2.epub --toc


A visual inspection of the two ePub files in iBooks reveals the following differences:

  * In `test.epub`, we see that `Title` occurs one additional time between the TOC and Chapter 1 (like you might expect to see in a printed book when transitioning from frontmatter to mainmatter).
  * In `test.epub`, there are two entries in the TOC, one each for Title and Chapter 1. In `test2.epub` there is only one entry in the TOC (for Chapter 1).

You can also verify the differences by e.g. diffing the extracted `toc.ncx` from each ePub file.

mkdir test_epub mkdir test2_epub cd test_epub/ unzip ../test.epub cd ../test2_epub/ unzip ../test2.epub cd .. diff -u test_epub/EPUB/toc.ncx test2_epub/EPUB/toc.ncx


Produces:

```diff
--- test_epub/EPUB/toc.ncx  2021-06-24 06:22:52.000000000 -0700
+++ test2_epub/EPUB/toc.ncx 2021-06-24 06:22:54.000000000 -0700
@@ -1,7 +1,7 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <ncx version="2005-1" xmlns="http://www.daisy.org/z3986/2005/ncx/">
   <head>
-    <meta name="dtb:uid" content="urn:uuid:27eb95e7-2ae5-4939-8ab6-d971eb80df91" />
+    <meta name="dtb:uid" content="urn:uuid:9c96edeb-bc92-4f87-bcda-1acf8601b5d3" />
     <meta name="dtb:depth" content="1" />
     <meta name="dtb:totalPageCount" content="0" />
     <meta name="dtb:maxPageNumber" content="0" />
@@ -18,15 +18,9 @@
     </navPoint>
     <navPoint id="navPoint-1">
       <navLabel>
-        <text>Title</text>
-      </navLabel>
-      <content src="text/ch001.xhtml#title" />
-    </navPoint>
-    <navPoint id="navPoint-2">
-      <navLabel>
         <text>Chapter 1</text>
       </navLabel>
-      <content src="text/ch002.xhtml#chapter-1" />
+      <content src="text/ch001.xhtml#chapter-1" />
     </navPoint>
   </navMap>
 </ncx>

Is it possible that Pandoc is somehow applying Latex's \mainmatter logic in ePub output? Is this intended? (I'm pretty sure it's not documented.)

My normal expectation is that raw code for a given format is applied only in that format. So I'd normally expect that putting raw TeX code into a Markdown file wouldn't affect rendering to ePub at all.

This is on macOS 11.4 with Pandoc 2.14.0.3.

$ pandoc --version
pandoc 2.14.0.3
Compiled with pandoc-types 1.22, texmath 0.12.3, skylighting 0.10.5.2,
citeproc 0.4.0.1, ipynb 0.1.0.1
jgm commented 3 years ago

Pandoc does the splitting of the document into chapters before the rendering phase. In the rendering phase, no output will be produced by the raw latex block, but the splitting phase still sees a block there and produces a chapter for it. We should be able to make the splitting more intelligent, though.

jgm commented 3 years ago

You can work around this by using a lua filter that removes raw TeX blocks from the AST -- this will be done prior to splitting.

untested:

function RawBlock(el)
  if el.format == "tex" then return {} end
end
elliottslaughter commented 3 years ago

That's basically what I did, except I wrote the filter in Haskell. In case it helps anyone else:

{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Definition
import Text.Pandoc.JSON

rawTeX :: Block -> [Block]
rawTeX (RawBlock (Format "tex") _) = []
rawTeX x = [x]

main :: IO ()
main = toJSONFilter rawTeX