TOC empty in 3.1.11 - Githubissues

elliottslaughter commented 6 months ago

I believe that the fix for #9255 is not quite working in Pandoc 3.1.11. When building with 3.1.10, everything works normally. When I upgrade to 3.1.11, the TOC is empty: I have a "Contents" page but there are no entries in the TOC itself.

Everything should be identical except for the Pandoc version, which I am flipping back and forth.

Other details, not sure what matters:

I am using a custom template, but it is very similar to Pandoc's default template. Basically the only thing I replace is the title page.
I am using --pdf-engine=lualatex. I wonder if different TeX engines have different output, and maybe that's tripping up the new parsing algorithm?

My lualatex version is:

$ lualatex --version
This is LuaHBTeX, Version 1.16.0 (TeX Live 2023)
Development id: 7567

Execute  'luahbtex --credits'  for credits and version details.

There is NO warranty. Redistribution of this software is covered by
the terms of the GNU General Public License, version 2 or (at your option)
any later version. For more information about these matters, see the file
named COPYING and the LuaTeX source.

LuaTeX is Copyright 2022 Taco Hoekwater and the LuaTeX Team.

I'm not quite sure how to debug this, but if there are things I can do, let me know.

jgm commented 6 months ago

It would be helpful if you could run some experiments with (a) the default latex template, and (b) xelatex instead of lualatex. This could help rule out some possible explanations.

jgm commented 6 months ago

Note commit 3c178690e307f6f2e43d64c341712b1bf609e7fc which came in after 3.1.11 was released. It may fix the issue you're having.

elliottslaughter commented 6 months ago

Indeed, I have confirmed that building Pandoc from source at commit 87533e2d04539cf27e58e287759912f897962170 (which is newer than the one you linked) fixes the problem.

elliottslaughter commented 6 months ago

Sorry, this actually isn't fixed. I'm not sure what I did, but somehow I messed up my test in https://github.com/jgm/pandoc/issues/9295#issuecomment-1873656296.

I've been digging in to figure out what's going on and I think I know what's happening now.

Here's a minimal reproducer (note this is two files):

I produced test.json by running pandoc test.md -o test.json and then hand-modifying the Header to clear the identifier to "". This approximates something I'm doing in a filter in my custom build, where I'm building headers with nullAttr.

Now here's where things get fascinating:

$ pandoc --standalone --toc test.md -o test.md.tex
$ pandoc --standalone --toc test.json -o test.json.tex
$ diff -u test.md.tex test.json.tex
--- test.md.tex 2024-01-05 22:24:18
+++ test.json.tex   2024-01-05 22:29:22
@@ -63,7 +63,7 @@
 \setcounter{tocdepth}{3}
 \tableofcontents
 }
-\section{Test Chapter}\label{test-chapter}
+\section{Test Chapter}

 Text.

The TeX source from my json file has no label on the chapter. Because the label is missing, pdflatex doesn't generate a warning:

$ rm -f *.toc *.log *.aux && pdflatex test.md.tex &> out.log && grep 'Rerun' out.log 
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
$ rm -f *.toc *.log *.aux && pdflatex test.json.tex &> out.log && grep 'Rerun' out.log

Therefore, when Pandoc attempts to generate a PDF for my json file, it doesn't see a warning, doesn't think it needs to rerun pdflatex, and doesn't end up filling the TOC.

Looking at the output log, the only item I see that you could maybe usefully look for is:

No file test.json.toc.

Maybe Pandoc needs to additionally check this message to catch missing TOCs?

Otherwise it becomes a hard requirement that anything generating Pandoc ASTs must generate the corresponding identifiers, purely for the purpose of generating labels that will trigger warnings when running pdflatex. That seems like an unintuitive requirement, and easy to get wrong.

elliottslaughter commented 6 months ago

Actually, I think the approach suggested in my last comment (looking for No file ...) is going to be insufficient. The reason is that when you have a large file with many sections, the TOC can span multiple pages. In this case, it will be necessary to run pdflatex a total of 3 times to get the correct pages numbers in the TOC. In this case, the second run of pdflatex generates no warnings and no No file example.toc. There is literally no way to detect this case from the log.

Here's a reproducer with a large TOC:

large.md
large.json (generated by pandoc large.md -o large.json and then hand-edited to remove Header identifiers)

As before I generate large.json.tex via:

pandoc --standalone --toc large.json -o large.json.tex

Here's the output from the first three runs of pdflatex large.json.tex:

You can see for yourself that the PDF is correct only in the 3rd run, and the log files provide no guidance as to how many runs we need.

Therefore, I think the only reliable solution is to require 3 passes of pdflatex when a TOC is being requested. We can rely on warnings in non-TOC cases, but when a TOC is involved, I think we can't get around hard-coding the number of runs.

elliottslaughter commented 6 months ago

I guess another solution could be to diff the *.toc files before and after each run of pdflatex. If the toc file does not change, presumably we do not need to rerun pdflatex. That could potentially save 1 of the 3 runs required in a subset of situations (e.g., when book is used, or article with a TOC that happens to fit on 1 page). For some users with large documents that might be a significant speedup.

jgm commented 6 months ago

Thanks for your careful analysis!

jgm / pandoc

TOC empty in 3.1.11 #9295