jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.66k stars 3.38k forks source link

Pandoc produces 0 length PDF from docx #8970

Open Moulick opened 1 year ago

Moulick commented 1 year ago

Explain the problem.

pandoc fails to convert a docx file to pdf. It outputs a empty PDF file from this particular pdf. File attached new_resume_001.docx

Pandoc version? What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)

OS: MacOS 13.5 (22G74)

❯ pandoc --version
pandoc 3.1.6
Features: +server +lua
Scripting engine: Lua 5.4

I have tried with basictex and mactex (15 March 2023 5.51 GB). Same result in both.

❯ pdflatex --version
pdfTeX 3.141592653-2.6-1.40.25 (TeX Live 2023)
kpathsea version 6.3.5
Copyright 2023 Han The Thanh (pdfTeX) et al.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Han The Thanh (pdfTeX) et al.
Compiled with libpng 1.6.39; using libpng 1.6.39
Compiled with zlib 1.2.13; using zlib 1.2.13
Compiled with xpdf version 4.04
jgm commented 1 year ago

I can confirm that pandoc parses this docx into an empty document. Will need to examine it more closely to see why.

jgm commented 1 year ago

I suspect this is actually just another manifestation of #3086. In your docx, the textual content comes under <v:textbox> elements. They are also under <mc:AlternativeContent> / <mc:Fallback>. so #5394 may also be relevant.

jgm commented 1 year ago

Sketch of the xml structure:

<w:p>
  <w:r>
    <mc:AlternateContent>
      <mc:Choice Requires="wps">
       <w:drawing>
      <mc:Fallback>
        <w:pict>
        ...
        <v:textbox>
          <w:txbxContent>
            <w:p>
              ...textual content here...