jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.91k stars 3.39k forks source link

epub noterefs across files not properly converted #5531

Open alibou99 opened 5 years ago

alibou99 commented 5 years ago

PS C:\files\dev\Pandoc> pandoc --version pandoc.exe 2.7.2 Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

issue : when I convert EPUB files to md, docx, or html, some of the text is missing. it happens when there is a note call. here is an example :

<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-002-1" href="p1chap2.xhtml#ntb-002">1</a>. 
Text2
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text3
</p>
<p class="txt_courant_justif">
Text4
</p>

in this example, the Tex2 and Text3 are missing in the output

jgm commented 5 years ago

Simpler way to reproduce this:

% pandoc -f html+epub_html_exts -t native
<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-002-1" href="p1chap2.xhtml#ntb-002">1</a>. 
Text2
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text3
</p>
<p class="txt_courant_justif">
Text4
</p>
^D
[Para [Span ("page_36",[],[("type","pagebreak"),("title","36")]) [],SoftBreak,Str "Text1",SoftBreak,Str ""]
,Para [Str "Text4"]]
alibou99 commented 5 years ago

I'm new to Git, excuse me if I do not understand everything, but what does it mean ?

alibou99 commented 5 years ago

my source document is an epub 3.0 not an html

jgm commented 5 years ago

This gives a way to reproduce the underlying issue in a simpler way, without actually producing an epub (because the epub reader uses the html reader plus a special extension under the hood). It's really a "note to self" for me to diagnose this.

alibou99 commented 5 years ago

thank you very much, I just tested the conversion via Calibre, no problem, I have the whole text. However with caliber, the notes are not recognized as such

jgm commented 5 years ago

Yes, pandoc is stumbling on notes that refer to another file, such as href="p1chap2.xhtml#ntb-002".

jgm commented 5 years ago

With the commit I just pushed, we now get:

[Para [Span ("page_36",[],[("type","pagebreak"),("title","36")]) [],SoftBreak,Str "Text1",SoftBreak,Link ("ap_ntb-002-1",["apnb"],[]) [Str "1"] ("p1chap2.xhtml#ntb-002",""),Str ".",SoftBreak,Str "Text2",SoftBreak,Link ("ap_ntb-003-1",["apnb"],[]) [Str "2"] ("p1chap2.xhtml#ntb-003",""),SoftBreak,Str "Text3"]
,Para [Str "Text4"]]

which is an improvement. The missing text is no longer missing. However, the noterefs are being parsed as links rather than proper noterefs, so there is still work to do.

alibou99 commented 5 years ago

I work for a non-profit organization, we prepare books for digital braille so that it is used by the blind. for this, our pivot format is docx or RTF. for the moment Pandoc manages at least the thing, but with this problem of the missing texts, I am reviewing all the procedure to switch to another tool, I hope that we will find a quick solution.

jgm commented 5 years ago

By tonight there should be a nightly available in pandoc-nightlies; this will at least solve the missing text problem.

alibou99 commented 5 years ago

very good news, how can I benefit from this corrected version as quickly as possible ?

alibou99 commented 5 years ago

I installed pandoc via chocolatey

alibou99 commented 5 years ago

this is my first post in the git, is there a specific command to update Pandoc on my computer and take advantage of the fix ? thank you

jgm commented 5 years ago

Here's a binary of the latest Windows build: https://ci.appveyor.com/project/jgm/pandoc/build/job/gy92q5at64l3e68q/artifacts

alibou99 commented 5 years ago

thank you very much it works very well and I'm no longer missing text. now trying to see the problem of footnotes. here are two examples, the first code works very well, the conversion eoub to docx produces a word document that recognizes the footnotes, the second example do not have it. example1: good one

<p class="nonindentb">Text1<a epub:type="noteref" class="noteref" id="fn-1" href="#fn1">1</a> Text2</p>

<div epub:type="footnote" id="fn1">
<p class="noindent0"><a class="link" href="#fn-1"><span style="color: #000000;">1</span></a>. Text...</p>
</div>

example2 : bad one

<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text2
</p>
<p class="txt_courant_justif">
Text4
</p>

<section class="defnotes" epub:type="footnotes">
<!--note--><aside class="ntb" epub:type="footnote" id="ntb-003">
<p class="txt_justif"><a href="p1chap2.xhtml#ap_ntb-003-1">2</a>. Text...</p></aside>
<!--note--></section></section>

I greatly appreciate your help

jgm commented 5 years ago

Yes, the problem is that pandoc currently will only pick up footnotes that are defined in the same file. In your second example the note is in a different file.