jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.39k stars 3.37k forks source link

pandoc doesn't preserve order of Xml elements in settings.xml #9264

Closed edwintorok closed 10 months ago

edwintorok commented 10 months ago

Explain the problem.

Using docx-validator on settings the reference doc now validates:

$ pandoc --print-default-data-file=reference.docx >|reference.docx
$ ./validate reference.docx
./tmp/document-pretty.xml validates
DOCUMENT
No entities in internal subset
No entities in external subset
./tmp/styles-pretty.xml validates
./tmp/settings-pretty.xml validates

However a newly created empty document does not:

touch test.md
pandoc test.md -o test.docx --reference-doc reference.docx
./validate test.docx
./tmp/settings-pretty.xml:11: element zoom: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}zoom': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}doNotIncludeSubdocsInStats, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}doNotAutoCompressPictures, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}forceUpgrade, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}captions, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}readModeInkLockDown, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}smartTagType, {http://schemas.openxmlformats.org/schemaLibrary/2006/main}schemaLibrary, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shapeDefaults, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}doNotEmbedSmartTags, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}decimalSymbol ).
./tmp/settings-pretty.xml fails to validate

It looks like the settings got reordered and the 'zoom' tag is now in the wrong place:

<w:settings xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:sl="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word">
  <w:stylePaneFormatFilter w:val="0004"/>
  <w:footnotePr>
    <w:footnote w:id="-1"/>
    <w:footnote w:id="0"/>
  </w:footnotePr>
  <w:rsids>
  </w:rsids>
  <w:clrSchemeMapping w:accent1="accent1" w:accent2="accent2" w:accent3="accent3" w:accent4="accent4" w:accent5="accent5" w:accent6="accent6" w:bg1="light1" w:bg2="light2" w:followedHyperlink="followedHyperlink" w:hyperlink="hyperlink" w:t1="dark1" w:t2="dark2"/>
  <w:zoom w:percent="100"/>

This is probably due to this code in Writer/Docx.hs:

settingsEntry <- copyChildren refArchive distArchive settingsPath epochtime settingsList

The order of elements in settings.xml can be seen in wml.xsd

Pandoc version?

I've built it from latest main:

$ git describe --always
5875de3f8
$ pandoc --version
pandoc 3.1.11
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /var/home/edwin/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented 10 months ago

One sensible approach might be to put all the elements that can go in settings.xml, in order, in the archive reference.docx. Then we could simply update the ones that are found in the user's reference.docx, leaving the order. But to do this I'd have to know what default values to give all these settings.

jgm commented 10 months ago

I guess it's just as easy to embed the ordered list of element names in the Haskell code... [EDIT:] There is such a list already, settingsList, it's just not complete or correctly ordered!

edwintorok commented 10 months ago

I wrote a script that attempts to fix up the docx (and a small test document now successfully validates according to the .xsd, but not yet according to the .Net tool). In particular the list of tags for settings is here if it helps (and I'll try to create issues or send PRs to fix the other things that I found when I find some time).

jgm commented 10 months ago

I think I have it working now, but further testing always welcome!

jgm commented 10 months ago

I ran docx-validator on all the golden tests in test/docx/golden. These were the failures:

lists.docx

./tmp/document-pretty.xml:129: element pStyle: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressLineNumbers, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pBdr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shd, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tabs, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressAutoHyphens, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}kinsoku, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}wordWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}overflowPunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}topLinePunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}autoSpaceDE ).

lists_div_bullets.docx

./tmp/document-pretty.xml:33: element pStyle: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressLineNumbers, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pBdr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shd, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tabs, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressAutoHyphens, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}kinsoku, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}wordWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}overflowPunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}topLinePunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}autoSpaceDE ).
./tmp/document-pretty.xml:45: element pStyle: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressLineNumbers, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pBdr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shd, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tabs, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressAutoHyphens, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}kinsoku, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}wordWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}overflowPunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}topLinePunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}autoSpaceDE ).

lists_multiple_initial.docx

./tmp/document-pretty.xml:10: element pStyle: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressLineNumbers, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pBdr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shd, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tabs, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressAutoHyphens, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}kinsoku, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}wordWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}overflowPunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}topLinePunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}autoSpaceDE ).
./tmp/document-pretty.xml:41: element pStyle: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressLineNumbers, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pBdr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}shd, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tabs, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}suppressAutoHyphens, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}kinsoku, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}wordWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}overflowPunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}topLinePunct, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}autoSpaceDE ).

table_one_row.docx

./tmp/document-pretty.xml:9: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).

tables-default-widths.docx

./tmp/document-pretty.xml:18: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).
./tmp/document-pretty.xml:236: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).
./tmp/document-pretty.xml:301: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).

tables.docx

./tmp/document-pretty.xml:18: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).
./tmp/document-pretty.xml:237: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).
./tmp/document-pretty.xml:303: element jc: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblCaption, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblDescription, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPrChange ).
jgm commented 7 months ago

@edwintorok Was endnotePr left out of the list for a reason? EDIT: I see now that endnotePr and footnotePr are both problematic, because they may depend on the endnotes.xml or footnotes.xml in the reference docx, which isn't copied over.

edwintorok commented 7 months ago

I don't see endnotePr anywhere in pandoc currently, and as you say tag order isn't the only thing missing in order to support it. The OOXML validator in the CI should be pretty good at picking up missing references though once you start using it.