jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.78k stars 3.39k forks source link

DOCX writer - bookmark locations mess up when referencing #8825

Open svenboulanger opened 1 year ago

svenboulanger commented 1 year ago

I'm not sure if this needed to be an enhancement instead, but here goes. I found out that Pandoc exports this XML for headers:

<w:bookmarkStart w:id="20" w:name="header-name" />
<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1" />
    </w:pPr>
    <w:r>
        <w:t xml:space="preserve">Header Name</w:t>
    </w:r>
</w:p>
<!-- Other paragraphs and stuff -->
<w:bookmarkEnd w:id="20" />

The problem is the <w:bookmarkStart> and <w:bookmarkEnd> are outside the header paragraph. When adding a reference to it, i.e. this XML:

<w:r>
    <w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
    <w:instrText xml:space="preserve"> REF acronyms \h </w:instrText>
</w:r>
<w:r>
    <w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
    <w:t>Header Name</w:t>
</w:r>
<w:r>
    <w:fldChar w:fldCharType="end"/>
</w:r>

When updating these reference fields in Word, it causes the entire contents of the bookmark tags to be copied instead of just the header text and this messes up the document. The cause is of course that the bookmark tags appear outside of the header tag in the XML. I have also noticed other issues, such as inserting a header before a labeled header messes up references. The reason for the latter is that the new header gets spliced in right after the <w:bookmarkStart> tag, making the spliced-in header now the target of all references.

Note that it probably only matters to people that want to add cross references from pandoc. If you make a crossreference from within Word, it will automatically create a second pair of bookmark tags that are in fact inside the header paragraph.

I propose to instead generate the following XML, which is closer to what Word exports by itself:

<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1" />
    </w:pPr>
    <w:bookmarkStart w:id="20" w:name="header-name" />
    <w:r>
        <w:t xml:space="preserve">Header Name</w:t>
    </w:r>
    <w:bookmarkEnd w:id="20" />
</w:p>
<!-- Other paragraphs and stuff -->
svenboulanger commented 1 year ago

For those interested, right now I'm working around it by using the following lua filter as the last filter (part was used from this issue). It simply replaces all headers with their XML.

-- This filter will replace all headers by their OOXML as it otherwise interferes with referencing
local counter = 0
local current_index = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
local zip = require 'pandoc.zip'

function inlines_to_ooxml(inlines)
    local docx = pandoc.write(pandoc.Pandoc(pandoc.Para(inlines)), 'docx')
    local document_entry = zip.Archive(docx).entries:find_if(
      function (entry) return entry.path == 'word/document.xml' end
    )

    -- Extract the paragraph contents
    text = document_entry:contents():gsub('.*<w:p>(.*)</w:p>.*', '%1')

    -- Also remove paragraph styling
    text = text:gsub('<w:pPr>.*</w:pPr>', '')
    return text
end

function Header(header)
    is_numbered = true
    index = nil
    for _, value in ipairs(header.attr.classes) do
        if value == 'unnumbered' then is_numbered = false end
    end

    xml = { '<w:p>' }

    -- Styling
    table.insert(xml, '<w:pPr>')
    table.insert(xml, '<w:pStyle w:val="Heading' .. header.level .. '" />')
    if not is_numbered then
        table.insert(xml, '<w:numPr><w:ilvl w:val="0"/><w:numId w:val="0"/></w:numPr>')
    end
    table.insert(xml, '</w:pPr>')

    -- Add section numbering if applicable
    if PANDOC_WRITER_OPTIONS.number_sections and is_numbered then
        -- Increment the header index and add it
        current_index[header.level] = current_index[header.level] + 1
        for i = header.level+1,#current_index do current_index[i] = 0 end
        index = ''
        for i = 1,header.level do
            if i > 1 then index = index .. '.' end
            index = index .. current_index[i]
        end
        table.insert(xml, '<w:r><w:rPr><w:rStyle w:val="SectionNumber" /></w:rPr><w:t xml:space="preserve">' .. index .. '</w:t></w:r><w:r><w:tab /></w:r>')
    end

    -- Start of bookmarks
    if header.attr.identifier ~= nil then
        counter = counter + 1
        table.insert(xml, '<w:bookmarkStart w:id="h' .. counter .. '" w:name="' .. header.attr.identifier .. '" />')
    end

    -- Header contents
    table.insert(xml, inlines_to_ooxml(header.content))

    -- End of bookmarks
    if header.attr.identifier ~= nil then
        table.insert(xml, '<w:bookmarkEnd w:id="h' .. counter .. '" />')
    end

    table.insert(xml, '</w:p>')
    return pandoc.RawBlock('openxml', table.concat(xml, ''))
end
tarleb commented 1 year ago

So if I understand correctly, then the problem is that the docx writer treats heading IDs as identifiers for the whole section? It seems sensible to change that, but I'm not sure if there could be unintended consequences.

svenboulanger commented 1 year ago

That is correct.

From what I can find, bookmarks can be placed anywhere in the document (source follows the convention Word uses).

It doesn't strictly violate the format of an OpenXML document (it is not a syntax error), but it doesn't play nice when combined with the OOXML referencing. The main problem I'm having I think is described here:

If the text marked by the bookmark contains a paragraph mark, the text preceding the REF field assumes the formatting of the paragraph in the bookmark.

The bookmark always contains a (header) paragraph in the way that pandoc has implemented the writer. Even if I create a correct REF field instruction pointing to a pandoc-generated header identifier, then updating the fields in Word itself results in weird things (i.e. it is not compatible).

jgm commented 1 year ago

So is the desired output something like this?

<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1" />
    </w:pPr>
   <w:bookmarkStart w:id="20" w:name="header-name" />
    <w:r>
        <w:t xml:space="preserve">Header Name</w:t>
    </w:r>
   <w:bookmarkEnd w:id="20" />
</w:p>
<!-- Other paragraphs and stuff -->
svenboulanger commented 1 year ago

So is the desired output something like this?

<w:p>
    <w:pPr>
        <w:pStyle w:val="Heading1" />
    </w:pPr>
   <w:bookmarkStart w:id="20" w:name="header-name" />
    <w:r>
        <w:t xml:space="preserve">Header Name</w:t>
    </w:r>
   <w:bookmarkEnd w:id="20" />
</w:p>
<!-- Other paragraphs and stuff -->

That is correct. You might also want to take a look at this issue since I guess it targets the same code.