tomduck / pandoc-eqnos

A pandoc filter for numbering equations and equation references.
GNU General Public License v3.0
220 stars 27 forks source link

Make valid .docx file with equation numbering #64

Open nOkuda opened 2 years ago

nOkuda commented 2 years ago

Resolves #62

At the least, I've been able to produce a .docx file that opens in Google Docs. My solution was to remove what appear to be extraneous tags.

nOkuda commented 2 years ago

Here are the notes I took while diagnosing the error:

I am using pandoc 1.16.1, pandoc-eqnos 2.5.0, and pandocfilters 1.5.0

Here's the file I used: test.md.

Here's the command I used: pandoc --filter pandoc-eqnos -o test.docx test.md.

Here's the output: test.docx.

When attempting to open the .docx file in LibreOffice Writer, the error is reported to be on line 1, character 739 of the word/document.xml entity in the .docx file. Unzipping test.docx and opening up word/document.xml shows that this location is the start of an end tag </w:p>.

Immediately prior to this end tag, I see <w:bookmarkStart w:id="0" w:name="eq:a"/><w:r><w:t>. There are two things of note here. One is that segment looks like this line of code (line 216 in pandoc_eqnos.py at the time of writing). Another thing to note is that both <w:r> and <w:t> at the end of this segment lack their own closing tags before the </w:p> tag that the error message reported.

However, there are matching </w:t> and </w:r> end tags starting at character 1484. This particular segment, which appears as </w:t></w:r><w:bookmarkEnd w:id="0"/>, looks like this line of code (line 220 in pandoc_eqnos.py at the time of writing).

Another thing I've noticed but that might not be significant is that both of these segments I've commented on immediately follow the sequence <w:pPr><w:pStyle w:val="FirstParagraph" /></w:pPr>. If I'm interpreting this correclty, this is applying a first paragraph style to some span within the document. I'm not sure why the first paragraph styling is being applied at the end of the document, but maybe that's an expected behavior in .docx files.

My initial guess as to why the error is occurring has to do with how the results of _add_markup interact with the final json data structure that gets passed on to pandoc. Perhaps instead of returning bookmarkstart, AttrMath(*value), and bookmarkend as a list, as they are here (line 221 in pandoc_eqnos.py at the time of writing), they should be precombined into their own pandoc AST node.

I never figured out how to precombine the list, so I decided to take the even easier route of removing the <w:r><w:t> and the </w:t></w:r> from the earlier referenced lines of the code. I was at first reluctant to try this, since I was concerned about the bookmark tags. But on closer inspection, and after running into the example in the BookmarkStart class, I realized that they are self-closing, which means that they don't need to be in any specific relationship with the BookmarkEnd class. (I wonder if that means that the w:id attribute might need to be updated with different values for each equation referenced; a future fix to pandoc-eqnos, I suppose.)

To generate the .docx file with the modified code, I adapted the piping instructions on the Pandoc filter page, which lead to the following command: pandoc -s test.md -t json | python3 <path _to_modified>/pandoc_eqnos.py docx | pandoc -s -f json -o newtest.docx. This yielded newtest.docx. I was happy to see that LibreOffice Writer opened the file without reporting an error, but I was unsatisfied with the formatting. Looking at the word/document.xml in newtest.docx, I noticed that the formatting information for the equation was there, so I uploaded the file to Google Docs and verified that the equation was formatted more pleasantly there.

ianhbell commented 2 years ago

Any chance of merging this at some point?