DOCX writer - support unnumbered with reference doc

jgm / pandoc

Universal markup converter

https://pandoc.org

Other

34.68k stars 3.38k forks source link

DOCX writer - support unnumbered with reference doc #8824

Open svenboulanger opened 1 year ago

svenboulanger commented 1 year ago

Unnumbered sections in DOCX are supported in combination with the --number-sections flag, but not when using the a reference document that is set up to number sections. To compare:

The XML exported for a regular (numbered) section:

<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1"/>
    </w:pPr>
    <w:bookmarkStart w:id="1" w:name="bookmark"/>
    <w:r>
        <w:t>Header contents</w:t>
    </w:r>
</w:p>

When using --number-sections, pandoc adds the following XML:

<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1"/>
        <w:numPr>
            <w:ilvl w:val="0"/>
            <w:numId w:val="0"/>
        </w:numPr>
    </w:pPr>
    <w:bookmarkStart w:id="1" w:name="bookmark"/>
    <w:r>
        <w:t>Header contents</w:t>
    </w:r>
</w:p>

[edit] The XML shown here isn't actually from pandoc I discovered. But the issue is still the same (the <w:numPr> tags).

I propose to include the extra XML (the <w:numPr> part) regardless of the --number-sections flag, as this allows having unnumbered sections even when using a reference document handling the section numbering.

svenboulanger commented 1 year ago

Note that at the moment I am using a lua filter that adds the XML manually through RawBlocks:

local counter = 0
function Header(header)
    is_numbered = true
    for _, value in ipairs(header.attr.classes) do
        if value == 'unnumbered' then
            is_numbered = false
        end
    end

    if is_numbered == false then
        counter = counter + 1

        -- Replace with the expected OOXML format
        result = '<w:p>'
        result = result .. '<w:pPr><w:pStyle w:val="Heading1"/><w:numPr><w:ilvl w:val="0"/><w:numId w:val="0"/></w:numPr></w:pPr>'
        result = result .. '<w:bookmarkStart w:id="b' .. counter .. '" w:name="' .. header.attr.identifier .. '"/>'
        result = result .. '<w:r><w:t>' .. pandoc.utils.stringify(header.content) .. '</w:t></w:r>'
        result = result .. '</w:p>'
        return pandoc.RawBlock('openxml', result)
    end

    return header
end

tarleb commented 1 year ago

That's a good workaround. It would also be possible to generate the default XML for the Header with the docx_blocks function below; it would then be enough to insert the numbering code before the closing </w:pPr>.

local zip = require 'pandoc.zip'

function docx_blocks (blocks)
  local docx = pandoc.write(pandoc.Pandoc(blocks), 'docx')
  local document_entry = zip.Archive(docx).entries:find_if(
    function (entry) return entry.path == 'word/document.xml' end
  )
  return document_entry:contents()
    :gsub('.*<w:body>(.*)<w:sectPr ?%/></w:body>.*', '%1')
end

svenboulanger commented 1 year ago

Thanks, I'll try that. It feels a bit inefficient to start zipping/unzipping blocks for every unnumbered header though.

Also, thanks for reminding me that I can zip/unzip files from lua directly. I was breaking my head a while back about how I could replace the values of custom fields (they were used in the reference document header). I fixed it using python, but I prefer lua as it comes with pandoc anyway.

tarleb commented 1 year ago

It feels a bit inefficient to start zipping/unzipping blocks for every unnumbered header though.

It's definitely a bit heavy-handed, and I'd like to offer ooxml as a target format some day.