jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.5k stars 3.37k forks source link

Inline links seem broken when converting de-drm epub to org file #8470

Closed tillydray closed 6 months ago

tillydray commented 1 year ago

First off, I love pandoc, and every time I read its documentation I learn new tricks it can do :D Thank you for all your hard work!

Explain the problem

I realize this may not be a pandoc issue, there are several pieces of software involved in going from DRMed epub file to org file, and any one of them may be causing this problem. But the reason I suspect pandoc may be causing the problem is that the epub file looks and works fine in Apple Books, Emacs, and Calibre, so I believe the input file is fine. The org file also looks fine, but does not work fine, ie clicking on links doesn't work. So it seems to me, perhaps naively, that pandoc isn't quite creating the org file correctly.

I may be missing a command line flag or something obvious, but I spent a couple of hours reading the docs and trying to figure it out so it isn't obvious to me 😅

What Happened

In Emacs, when pressing RET on a link, I get this error output No match for custom ID: hcp-nrsvuebib-0010.xhtml#otpt.

What Did I Expect to Happen

I expected to jump to the link

Inputs

My input file is NRSVue, Holy Bible. If you need a copy to reproduce this let me know and I can provide. I used Calibre with DeACSM and DeDRM plugins to remove DRM.

Command Line Inputs

Below are various commands I used, all producing the same issue, copied and pasted from my terminal. I was grasping at straws to try to solve the problem, and read through nearly all of the man pages but didn't see anything that might help.

Minimal Output Example

Details ```org-mode [[#hcp-nrsvuebib-0005.xhtml#otbooks][Old Testament Table of Contents]] -------------- [[#hcp-nrsvuebib-0010.xhtml#otpt][OLD TESTAMENT]] -------------- [[#hcp-nrsvuebib-0011.xhtml#bk01][Genesis]] [[#hcp-nrsvuebib-0011.xhtml#ch01001][1]] | [[#hcp-nrsvuebib-0010.xhtml#ot][The Old Testament]] | [[#hcp-nrsvuebib-0011.xhtml#bk01][Genesis]] | [[#hcp-nrsvuebib-0011.xhtml#bk01][Gen]] | <> <> <> [[#hcp-nrsvuebib-0005.xhtml#otbooks][The Old Testament]] <> <> <> [[#hcp-nrsvuebib-0005.xhtml#rbk01][Genesis]] [[#hcp-nrsvuebib-0005.xhtml#rbk01][Genesis 1]] Six Days of Creation and the Sabbath 1When God began to create[[#hcp-nrsvuebib-0013.xhtml#fn01001001-1][a]] the heavens and the earth, 2the earth was complete <> ```

Software versions

pandoc 2.19.2
Compiled with pandoc-types 1.22.2.1, texmath 0.12.5.2, skylighting 0.13,
citeproc 0.8.0.1, ipynb 0.2, hslua 2.2.1
Scripting engine: Lua 5.4
User data directory: ~/.local/share/pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented 1 year ago

I think the issue is really in the EPUB reader and it can be shown with this simple example:

% pandoc -o my.epub
# One

link to [twosub](#twosub)

# Two

ok

## twosub

ok
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to '-' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".

% pandoc my.epub -t html
<p><span id="ch001.xhtml"></span></p>
<section id="ch001.xhtml#one" class="level1" data-number="1">
<h1 data-number="1">One</h1>
<p>link to <a href="#ch002.xhtml#twosub">twosub</a></p>
</section>
<p><span id="ch002.xhtml"></span></p>
<section id="ch002.xhtml#two" class="level1" data-number="2">
<h1 data-number="2">Two</h1>
<p>ok</p>
<section id="ch002.xhtml#twosub" class="level2" data-number="2.1">
<h2 data-number="2.1">twosub</h2>
<p>ok</p>
</section>
</section>

Here we get things like a reference to #ch002.xhtml#twosub. The fragment shouldn't contain the character #. I don't know if that's the only issue for org, but it may be one issue. You could try changing

[[#hcp-nrsvuebib-0005.xhtml#otbooks][Old Testament Table of Contents]]

in your org output to

[[#hcp-nrsvuebib-0005.xhtml_otbooks][Old Testament Table of Contents]]

and changing

<<hcp-nrsvuebib-0005.xhtml#otbooks>>

to

<<hcp-nrsvuebib-0005.xhtml_otbooks>>

and see if that fixes the link. That would be good for me to know.

tillydray commented 1 year ago

I made the changes to the example output but when clicking on the link I get this error: No match for custom ID: hcp-nrsvuebib-0005.xhtml_otbooks. In case it's relevant, I also re-generated the org from the original epub, made the changes you suggested, and still got the same error :( For a sanity check, I've pasted the changed example output just in case I've done it wrong somehow

[[#hcp-nrsvuebib-0005.xhtml_otbooks][Old Testament Table of Contents]]

--------------

[[#hcp-nrsvuebib-0010.xhtml_otpt][OLD TESTAMENT]]

--------------

[[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]

[[#hcp-nrsvuebib-0011.xhtml_ch01001][1]] |

[[#hcp-nrsvuebib-0010.xhtml_ot][The Old Testament]]

| [[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]       | [[#hcp-nrsvuebib-0011.xhtml_bk01][Gen]]     |

<<hcp-nrsvuebib-0010.xhtml>>

<<hcp-nrsvuebib-0010.xhtml_otpt>>

<<hcp-nrsvuebib-0010.xhtml_ot>>
[[#hcp-nrsvuebib-0005.xhtml_otbooks][The Old Testament]]

<<hcp-nrsvuebib-0011.xhtml>>

<<hcp-nrsvuebib-0011.xhtml_bk01>>

<<hcp-nrsvuebib-0011.xhtml_ch01001>>
[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis]]

[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis 1]]

Six Days of Creation and the Sabbath

1When God began to create[[#hcp-nrsvuebib-0013.xhtml_fn01001001-1][a]] the heavens and the earth, 2the earth was complete

<<hcp-nrsvuebib-0011.xhtml_ch01002>>
jgm commented 1 year ago

I'm copying the code from above since replies from email don't render as markdown:

[[#hcp-nrsvuebib-0005.xhtml_otbooks][Old Testament Table of Contents]]

--------------

[[#hcp-nrsvuebib-0010.xhtml_otpt][OLD TESTAMENT]]

--------------

[[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]

[[#hcp-nrsvuebib-0011.xhtml_ch01001][1]] |

[[#hcp-nrsvuebib-0010.xhtml_ot][The Old Testament]]

| [[#hcp-nrsvuebib-0011.xhtml_bk01][Genesis]]       | [[#hcp-nrsvuebib-0011.xhtml_bk01][Gen]]     |

<<hcp-nrsvuebib-0010.xhtml>>

<<hcp-nrsvuebib-0010.xhtml_otpt>>

<<hcp-nrsvuebib-0010.xhtml_ot>>
[[#hcp-nrsvuebib-0005.xhtml_otbooks][The Old Testament]]

<<hcp-nrsvuebib-0011.xhtml>>

<<hcp-nrsvuebib-0011.xhtml_bk01>>

<<hcp-nrsvuebib-0011.xhtml_ch01001>>
[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis]]

[[#hcp-nrsvuebib-0005.xhtml_rbk01][Genesis 1]]

Six Days of Creation and the Sabbath

1When God began to create[[#hcp-nrsvuebib-0013.xhtml_fn01001001-1][a]] the heavens and the earth, 2the earth was complete

<<hcp-nrsvuebib-0011.xhtml_ch01002>>
tillydray commented 1 year ago

The problem is that link to [[#ch002.xhtml#twosub][twosub]] should be link to [[ch002.xhtml_twosub][twosub]]. So remove the first # and replace the internal # with _. Once I do that it works as expected

tillydray commented 1 year ago

I did two naive find-replaces :%s/#hcp/hcp/g and :%s/xhtml#/xhtml_/g and that fixed some but not all.

Messiah,[[hcp-nrsvuebib-0137.xhtml_fn40001001-3][c]] the son of David, is supposed to jump to [[hcp-nrsvuebib-0136.xhtml_rfn40001001-3][c]] 1.1 Or /Jesus Christ/ but doesn't. When I reconcile their differences, it still doesn't jump. I get this error output No match for fuzzy expression: hcp-nrsvuebib-0137.xhtml_fn40001001-3

jgm commented 6 months ago

I believe the issues here have been fixed by now (esp. the misplaced #). Closing this issue as stale.

Enivex commented 1 month ago

I believe the issues here have been fixed by now (esp. the misplaced #). Closing this issue as stale.

In pandoc 3.4 I'm still getting broken links from EPUBs. Converting to a single html i e.g. get links like

<p><a href="#part0100.html#c74" class="calibre1">74: GHOSTBLOOD</a></p>

that are all broken. (They are broken regardless of output format.)

(I also posted here https://github.com/jgm/pandoc/issues/6384#issuecomment-2366784449 )

jgm commented 1 month ago

@Enivex if you have a reproducible bug, please open a new issue with full information needed to reproduce it.

Enivex commented 1 month ago

@Enivex if you have a reproducible bug, please open a new issue with full information needed to reproduce it.

Sure, though I can't upload this particular EPUB for legal reasons. I'll try to see if I can reproduce it from some freely available one.