jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.52k stars 3.37k forks source link

Docx reader: Handle multi-paragraph complex fields #4678

Open Fnorxblohon opened 6 years ago

Fnorxblohon commented 6 years ago

I ran into a problem with a multi-paragraph field in a word document. The docx reader discards the paragraph breaks, rendering the output quite unpleasant.

The problem occurs with bibliographies, indices and references to locations which yield several paragraphs. These are single huge fields beginning with the headline and including all bibliography entries, each in one paragraph.

In the Office Open XML source (I removed some irrelevant properties and bookmarks.), the long field begins with a paragraph containing the "begin" w:fldChar.

<w:p ...>
  <w:pPr>...</w:pPr>
  <w:r><w:fldChar w:fldCharType="begin"/></w:r>

Next, an instrText tells that this is a BIBLIOGRAPHY. (In an aside: I am not actually working on Word Bibliographies but on Citavi, but for this example I stuck to Microsoft's standard offering. Others look similar.)

  <w:r><w:instrText>BIBLIOGRAPHY</w:instrText></w:r>

Next, the "separate" field char introduces the cached presentation:

  <w:r><w:fldChar w:fldCharType="separate"/></w:r>

Finally, there is what the user sees of this paragraph: the first reference.

  <w:r ...><w:rPr>...</w:rPr><w:t xml:space="preserve">Loeliger, J. (2009).</w:t></w:r>
  ...
</w:p>

Note how this paragraph ends without an "end" field char. More paragraphs contain more references.

Finally, there is the "end" fldChar:

<w:p ...>
  <w:r><w:rPr>...</w:rPr><w:fldChar w:fldCharType="end"/></w:r>
</w:p>

Currently, this results in one paragraph of the form

Para [Str "Loeliger,",Space,Str "J.",Space,Str "(2009).",Space,Emph [Str "Version",Space,Str "Control",Space,Str "with",Space,Str "Git."],Space,Str "Sebastopol:",Space,Str "O'Reilly.Oram,",Space,Str "A.,",Space,Str "&",Space,Str "Talbott,",Space,Str "S.",Space,Str "(1993).",Space,Emph [Str "Managing",Space,Str "Projects",Space,Str "with",Space,Str "make."],Space,Str "Sebastopol:",Space,Str "O'Reilly."]]

According to http://officeopenxml.com/WPfields.php, this structure of fldChar elements is described in ECMA-376, 3rd Edition (June, 2011), Fundamentals and Markup Language Reference § 17.16.18. It mentions multi-paragraph fields as one use of this structure.

src/Pandoc/Readers/Docx/Parse.hs handles this in lines 756ff, handling run, bookmark, oMath, comment and hyperlink elements, apparently discarding all unknown tags.

A simplistic solution would keep the content-gathering stage across paragraph ends. I have posted some crude code that implements this on pandoc-discuss in https://groups.google.com/forum/#!topic/pandoc-discuss/ItyWCB9HgKI. That triggers a bug in handling complex fields without cached representations, https://groups.google.com/forum/#!topic/pandoc-discuss/k4CbdmNDsew.

Here is a small test case: complex-fields-bibliography.docx The expected result is:

[Para [Str "Make:",Space,Str "(",Space,Str "(Oram",Space,Str "&",Space,Str "Talbott,",Space,Str "1993)),",Space,Str "git:",Space,Str "(",Space,Str "(Loeliger,",Space,Str "2009))"]
,Header 1 ("literaturverzeichnis",[],[]) [Str "Literaturverzeichnis"]
,Para [Str "Loeliger,",Space,Str "J.",Space,Str "(2009).",Space,Emph [Str "Version",Space,Str "Control",Space,Str "with",Space,Str "Git."],Space,Str "Sebastopol:",Space,Str "O'Reilly."]
,Para [Str "Oram,",Space,Str "A.,",Space,Str "&",Space,Str "Talbott,",Space,Str "S.",Space,Str "(1993).",Space,Emph [Str "Managing",Space,Str "Projects",Space,Str "with",Space,Str "make."],Space,Str "Sebastopol:",Space,Str "O'Reilly."]]

I am looking at the current pandoc HEAD, where I have only added the attached file as a test case. Its version is:

pandoc 2.2.1
Compiled with pandoc-types 1.17.4.2, texmath 0.11, skylighting 0.7.0.2
Default user data directory: /home/fxg/.pandoc
Copyright (C) 2006-2018 John MacFarlane
Fnorxblohon commented 6 years ago

I have a solution for this in two patches: The first introduces the test cases, the second implements the solution. 0001-Docx-reader-Added-Tests-for-long-complex-fields.txt 0002-Docx-reader-Handling-of-long-complex-fields.txt

jkr commented 6 years ago

Thanks for getting to the bottom of this. I'm at a conference right now, and this is a pretty thorny, spec-heavy part of the code, so I might not be able to get to it until I get back at the beginning of next week. But I'll definitely read through it (and you emails) ASAP.

joshco commented 1 year ago

Was this ever added? I think I'm seeing the same issue. I'm using citations with docx. I'm trying to have an entry that uses CSL "display" attribute features like left-margin, right-inline, and block to have a multiline entry. It renders in HTML, but in docx it is all one line run together.