jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.43k stars 3.37k forks source link

Broken internal links when converting epub #10207

Open Enivex opened 1 month ago

Enivex commented 1 month ago

Explain the problem. Take an epub file that uses internal links, e.g. https://dieterplex.github.io/rust-ebookshelf/The%20Rust%20Programming%20Language.epub

Run pandoc -f epub -t typst '.\The Rust Programming Language.epub' --standalone -o 'trpl.typ'. The exact options or even output file type are not very important.

The resulting file includes links like #link(label("ch01-01-installation.html#troubleshooting")) (there are some other flavors too), which will not work, because the label it refers to does not exist in the document. The closest being <ch01-01-installation.html>, which refers to the entire chapter.

Pandoc version? What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)

pandoc 3.4, Windows 11


A separate issue is that in order for images to work, the files have to be manually extracted from the epub, and the placed correctly in relation the resulting typ file. (Not sure if I should create a separate issue for this) (This particular issue was fixed by adding the --extract-media . option. Not entirely sure why the . is required. Without it I get Couldn't extract ePub file: Did not find end of central directory signature)

jgm commented 1 month ago

--extract-media requires an argument (file path). That is why.

jgm commented 1 month ago

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

Enivex commented 1 month ago

I think I've fixed this issue. Note, however, that the EPUB you uploaded does not define an identifier troubleshooting in installation.html. You can make this work by adding -f html+auto_identifiers, but this seems like a bug in the EPUB.

You're right. I didn't upload it myself, I just looked for one where I was getting similar errors. Turns out this one had broken links even in the epub.

I'll try the nightly release later.

Enivex commented 1 month ago

Unfortunately the issue is not solved in the original file I was initially interested in. There are 237 missing label errors, even after trying to add auto_identifiers

image

The parts after _ corresponds to id-s in the htmls, but there's no corresponding labels being created in the typ file.

e.g. in part0111.html: image

Edit: Version 3.4-nightly-2024-09-23

jgm commented 1 month ago

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

Enivex commented 1 month ago

What do these anchors point to in the epub? You may need to unzip it and inspect the xhtml files contained therein, e.g. look in part0007.html and try to find the thing that has id pre or ack1.

ack1 corresponds to another link back to the other one image

pre corresponds to a heading image

jgm commented 4 weeks ago

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

jgm commented 4 weeks ago

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

Enivex commented 4 weeks ago

If you can give me an epub to test with, it would really help, even if it's just stripped down to couple of these examples (all the better).

I don't mind sending it to you for troubleshooting purposes, but I can't post it on github for obvious reasons.

Edit: Sent via email. Hopefully the large attachment doesn't cause issues.

Enivex commented 4 weeks ago

There is code that should be changing these ids. At least the pre should work (on a heading). The identifier on the a href might be ignored by the typst writer.

I just tested with latex output instead, and that has the same issue, so it's not only typst writer.

jgm commented 4 weeks ago

Most of the writers won't pay any attention to an identifier attribute on a Link element. (Try HTML.)

Enivex commented 4 weeks ago

Most of the writers won't pay any attention to an identifier attribute on a Link element. (Try HTML.)

Converting to html does work, but that's not that surprising (since epub is html based)

jgm commented 4 weeks ago

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

Enivex commented 4 weeks ago

that's not that surprising (since epub is html based)

Yes, but remember that pandoc isn't just moving the HTML from EPUB to the output. It is parsing everything into an intermediate data structure and re-rendering it. If it works with HTML, that shows that the identifier on the link does get parsed and represented in the AST. So the issue is simply that the Typst (and LaTeX) writer doesn't do anything with this attribute.

That makes sense.

jgm commented 4 weeks ago

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

Enivex commented 4 weeks ago

If you want to email me the epub, I can look into it further. At least the identifier on the heading should work.

I did email you the epub, as I described above. Though it may have vanished into the void because of the large ( 11 MB) attachment.

jgm commented 4 weeks ago

ok, found it in junk folder.

jgm commented 4 weeks ago

OK, here's one example.


error: label `<part0111.html#pre>` does not exist in the document
    ┌─ twok.typ:328:47
    │  
328 │   = <part0007.html_page14><part0007.html_page15>#link(label("part0111.html#pre"))[#strong[PRELUDE TO \

So I look in part0111.html in the epub, and here's where the anchor is:

<p class="toc1" id="pre"><a href="part0007.html#pre" class="calibre1">Prelude to the Stormlight Archive</a></p>

Pandoc doesn't put attributes on Para elements, so this identifier was lost in the parsing stage.

The other cases I've looked at are like this. Links to headings, tables, figures, divs, and spans should work fine. Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

jgm commented 4 weeks ago

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

Enivex commented 4 weeks ago

Anything else pandoc is going to drop, but those are the lion's share of real uses.

Probably this can be closed.

Is there a particular reason why it can't keep them? In this case it completely breaks the TOC.

For your immediate purposes, it might work to use a Lua filter to remove all internal links, so you don't get errors in typst.

That's probably what I'm going to end up doing short term. (Having working links is useful for navigation though.)

jgm commented 4 weeks ago

A sensible TOC has links to identifiers on headings (e.g. h2 in HTML). These should work fine in a pandoc conversion. This particular document has links all over the place -- to p elements, a elements, etc.

Pandoc has no place to put an id attribute on p, because its Para element has no slot of attributes.