Broken TOC links when converting multiple Markdown files to epub3

Canuck317 commented 1 year ago

Mentioned this issue in "Problem with missing TOC for h1 preceded by div /div (Issue #8996)", was advised to submit a new issues.

For v3.0 and up, generating epub3 documents from a list of Markdown files with table of contents has resulted in broken links in the table of contents. All links give the error "Destination does not exist". This is using files and commands that continue to work fine with v2.19. Headings at the beginning of each Markdown file indicate the section name.

# Introduction

Not every captain, blah, blah, blah, text goes on

The problem is that v3.0 and up does not add the name of the xhtml file containing the destination section to the destination link. Manually adding the file name fixes the link and corrects the problem. I note as well that all documents in the list file are combined into ch001.xhtml, which is not the case with a file list in v2.19 or when different sections are in the same Markdown file.

List of Markdown files, each with a # section header:

Broken link: <a href="text/#introduction.md__introduction">Introduction</a>

Fixed link: <a href="text/ch001.xhtml#introduction.md__introduction">Introduction</a>

The command used for this particular submission was: pandoc -o outputFile.epub --toc -f markdown -t epub3 --file-scope $(type ./listFile.txt) --epub-metadata=./zmeta.md

We usually use a slightly customized css file and epub template, but I omitted them while troubleshooting this.

For testing purposes, I copied a few pieces of two of the files in the list into a single file. The issue did not come up. Each section was its own .xhtml file, and the TOC links were generated properly.

Single file, multiple # section headers:

Thank you.

jgm commented 1 year ago

I can't do much unless you can give me an actual test case that I can reproduce myself...

Canuck317 commented 1 year ago

I can't do much unless you can give me an actual test case that I can reproduce myself...

I've attached files with a test case that reproduces the problem on my machine. I should have probably included that I'm currently using Windows 10 powershell, pandoc version 3.1.6.1.

Using the list file to produce outputFile.epub, TOC links are missing xhtml filename and do not work: pandoc -o outputFile.epub --toc -f markdown -t epub3 --file-scope $(type ./listFile.txt)

Using a single file with the same text combined works fine: pandoc -o single.epub --toc -f markdown -t epub3 -i combined.md

TestBook.zip

jgm commented 1 year ago

observations:

-- --file-scope is playing a role here -- without it, we get splitting into four chapter files and proper links -- Omitting zmeta.md fixes the problem with the links, but we still don't get splitting.

Todo: look at interactions between file-scope (which I believe inserts divs to track source locations) and section splitting code.

Canuck317 commented 1 year ago

-- --file-scope is playing a role here -- without it, we get splitting into four chapter files and proper links -- Omitting zmeta.md fixes the problem with the links, but we still don't get splitting.

I think it's less to do with the zmeta.md file itself and more to do with the presence of the metadata tags. I can reproduce that omitting 'zmeta.md' fixes the link problem. But if I put even just a title tag in, either at the start of Section1.md or using --metadata title="Test Title", the broken links are back.

jgm commented 1 year ago

When --file-scope is used, in T.P.App.Input, we apply a function adjustLinksAndIds to the parsed blocks produced from each file. This does the following (see #6384 for motivation):

wraps the file's blocks in a Div with an identifier derived from the file path
rewrites identifiers on elements with attributes by prepending an identifier derived from the file path (this ensures uniqueness)
rewrites links to one of the files using the new identifier with the prefixed file path

These changes are applied before the AST is sent to the EPUB writer, and I believe these extra Divs are fouling up the EPUB's writers attempt to divide the document into sections.

Probably the best fix is to change the section-splitting code so that it's aware of these extra divs.

jgm commented 1 year ago

Unfortunately there are a few pieces of code that would need to know about these: not just makeSections (T.P.Shared) but splitIntoChunks and toTOCTree (T.P.Chunks).

jgm / pandoc

Broken TOC links when converting multiple Markdown files to epub3 #9009