lierdakil / pandoc-crossref

Pandoc filter for cross-references
https://lierdakil.github.io/pandoc-crossref/
GNU General Public License v2.0
924 stars 74 forks source link

numbersections destroys tables in docx #223

Open tolot27 opened 5 years ago

tolot27 commented 5 years ago

Consider this simple markdown:

---
numbersections: true
...

# Table test

  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1

Table:  Demonstration of simple table syntax.

Converting it to docx using pandoc --filter pandoc-crossref --output=numbersections.docx numbersections.md creates a corrupt docx file which can be repaired by Word. But afterwards the table header is misaligend and Word shows a dialog with the table causing the error.

It does not matter if a section header is contained in the document or not. Even if numbersections: false, the docx get corrupted.

I've tested it with pandoc 2.7.1 and the latest pandoc-crossref v0.3.4.0 as well as with v0.4.0.0-alpha5. I do not know when this error occured the first time. In November, last year it worked, I'm sure. But if I test it now down to pandoc 2.2 it does not work. Strange.

tolot27 commented 5 years ago

Btw: Switching to numberSections does not produce corrupt docx but double numbers sections if a docx template already contains a numbered heading style.

tolot27 commented 5 years ago

I have to correct my self. All (?) versions of pandoc-crossref producing the correct output till pandoc 2.5, even the newest builds. As soon as I switch to pandoc 2.6, the docx output get corrupted.

lierdakil commented 5 years ago

Hi. Thanks for your report.

In the particular example you've provided in the OP, numbersections doesn't do anything... For one, as you noticed, the case is incorrect, and yes, metadata is case-sensitive. For two, it shouldn't do anything anyway in this example, since both chapterDepth and sectionsDepth are 0 by default. Finally, pandoc-crossref shouldn't actually do anything at all in this case since there isn't anything for it to do.

So. This doesn't seem to have anything to do with pandoc-crossref itself (or at least not with the code as it is)

Also, I wasn't able to reproduce this with

on Linux. This might be OS-specific.

One very important thing to remember though. Pandoc-crossref releases are tied to pandoc releases. If you use pandoc-crossref that was built against pandoc 2.5 with pandoc 2.7, weird stuff will happen. If you're using prebuilt binaries, I try to be rather explicit about what version of pandoc those are built with. Use v0.3.4.0c binaries with pandoc 2.6, and v0.3.4.0d with pandoc 2.7.

v0.4.0.0-alpha5 is built against pandoc 2.6, since this is ye olden preview release. Which is additionally broken in places (as should be expected from an alpha).

So, check if you're using proper binaries (run pandoc-crossref --version and it should tell you what it was built against). Also, check if pandoc-crossref is at all relevant here by trying to convert your Makrdown to docx without pandoc-crossref (i.e. pandoc -o output.docx input.md). If pandoc-crossref is indeed relevant here, and you are using proper binaries, please share your OS type and version, and we'll try to debug further.

tolot27 commented 5 years ago

Okay, I made some further tests and created a version matrix:

image

Most tests were done at Windows 10 x64 and Windows 8.1 x64. The three cells with a bold frame where testet with Ubuntu 16.04.5 LTS runing in the Linux Subsystem of Windows as well. dcc9965 is a linux-only build from the any-prefix branch. Gray cells mean that pandoc-crossref was build against the corresponding pandoc version. + means that the docx is not corrupt; red - means that the resulting docx got corrupted. And indeed, as soon as I compile the source file without pandoc-crossref, a valid docx is produced. That was one of my first tests. It is quite interessting that commit 449d443 works with all pandoc versions - at least with respect to numbersections.

Your test entry point should be v0.3.4.0d commit 9c61b7a because both the linux and the windows version producing corrupt docx.

BTW: I've carefully verified the commit numbers and build versions for every test case.

lierdakil commented 5 years ago

Thank you for thorough investigation. I was able to reproduce using v0.3.4.0d binary on Windows (not sure why I wasn't able to repro earlier, there's a chance that I used the correct case for numberSections subconciously -- well, that, or my environment got royally messed up due to working on v0.3 and v0.4 simultaneously). Long story short, you've uncovered a Pandoc bug, apparently.

Minimal testcase:

  1. create file input.md:

    ---
    test: 1
    Test: 1
    ...
  2. Run pandoc input.md -o test.docx

  3. Open test.docx. Observe "document is corrupted" message.

So, what's going on here?

As far as I can tell, this line from pandoc 2.6 release notes is pointing to the culprit:

Support custom properties (jgm/pandoc#3024, jgm/pandoc#5252, Agustín Martín Barbero). Also supports additional core properties: subject, lang, category, description.

Well, this and the fact that you've misspelled the case for numberSections here.

Looking at the raw OOXML, the main difference is in custom.xml.

Before pandoc-crossref:

<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2"
  name="numbersections">
    <vt:lpwstr>True</vt:lpwstr>
  </property>
</Properties>

After pandoc-crossref:

<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2"
  name="numbersections">
    <vt:lpwstr>True</vt:lpwstr>
  </property>
  <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3"
  name="numberSections">
    <vt:lpwstr>False</vt:lpwstr>
  </property>
  <!-- a lot more properties -->
</Properties>

So, pandoc-crossref adds all unspecified pandoc-crossref-related metadata fields into the document (it could in theory scrub those, but it doesn't at first glance seem like a good idea, since having all metadata set is convenient for documentation and interoperability with other filters). Pandoc then (since v2.6) blindly sticks that metadata into docx. And docx seems to handle two metadata fields with the name differing only in case very poorly.

82473ca2e4d39dbd5b3db4eb56fd605518bdaba4 and newer (which includes 449d443 and dcc9965) work because those don't have numberSections metadata field per se (it's possible to construct arbitrary object numbering schemes using templates and custom prefixes, so this option was removed in 82473ca2e4d39dbd5b3db4eb56fd605518bdaba4 as superfluous; some equivalent to quickly set-up everything to "just add section numbers" will be added at some point later, but pandoc-crossref won't add numberSections field if it's not there already)

So. The short-term solution: don't misspell metadata settings :stuck_out_tongue_closed_eyes: (sorry if this sounds somewhat grating, but I don't have any better ideas) The long-term solution: this should be reported upstream and fixed... somehow? I'm not entirely sure how exactly, probably not gracefully (e.g. discard "superfluous" metadata fields)

Not sure if you'd prefer to report upstream yourself, or have me do that. Let me know on how you'd like to proceed.

P.S. Forgot to address this one:

Btw: Switching to numberSections does not produce corrupt docx but double numbers sections if a docx template already contains a numbered heading style.

This is expected. numberSections just sticks section number into section headers as plain text. So if Word than adds its own numbers, pandoc-crossref is not aware of those (and indeed can not be aware of those). numberSections is mostly intended for use with formats that can't number sections on their own (or you can't be bothered with setting that up). The motivating example was IIRC HTML: I appreciate it's possible to fudge section numbering using CSS counters; most times I can't be bothered though, and not all software the document might be opened in supports those anyway.

tolot27 commented 5 years ago

Thanks for investigating this.

I don't see the point that I misspelled numbersections (used the incorrect case). The pandoc doc uses the lowercase version. I don't want to use the numberSections of pandoc-crossref, at least not till now.

Anyway, jgm/pandoc#5252 seem to be the root cause as you already mentioned. I'll report the bug upstream, soon.

tolot27 commented 5 years ago

Is it possible to enhance the v0.3.4.x branch in such a way that options not set explicitely are not stored in docx?

lierdakil commented 5 years ago

I don't see the point that I misspelled numbersections (used the incorrect case). The pandoc doc uses the lowercase version. I don't want to use the numberSections of pandoc-crossref, at least not till now.

Ah, so that's what you were trying to do. Right. Sorry. The similarity in names threw me off, for one. Also, it's not the conventional method to get pandoc to number sections, as far as I am aware.

Metadata fields and template variables are two rather different beasts (although pandoc did at some point start pulling variable values from metadata). If you want to tell pandoc to number sections, the conventional way to do so is to use --number-sections command line option I believe. Setting numbersections in metadata kinda works, but as far as I can tell, only for TeX, and only due to a particular cascade of coincidences. You could also try to only set the template variable using --variable=numbersections command line option, which should produce equivalent results to setting numbersections metadata field with TeX.

Is it possible to enhance the v0.3.4.x branch in such a way that options not set explicitely are not stored in docx?

"Stored in docx part" is entirely pandoc's responsibility, pandoc-crossref can't in any way influence that. The only thing it can do is to scrub its own options out of metadata before returning the document to pandoc. Frankly, that wouldn't be much of an enhancement, would it? I mean, sure, it works around a particularly unfortunate issue, but it also removes functionality that is more or less useful...

I'll try to figure something out. Can't promise when I can get to it though, spare time has been rather scarce lately.

tolot27 commented 5 years ago

Is it possible to enhance the v0.3.4.x branch in such a way that options not set explicitely are not stored in docx?

"Stored in docx part" is entirely pandoc's responsibility, pandoc-crossref can't in any way influence that. The only thing it can do is to scrub its own options out of metadata before returning the document to pandoc. Frankly, that wouldn't be much of an enhancement, would it? I mean, sure, it works around a particularly unfortunate issue, but it also removes functionality that is more or less useful...

I was just talking about the removal of numberSections, which I did not set.

I'll also try to switch from numbersections to numberSections and will see how it further affects the output (after adapting my reference.docx).

lierdakil commented 5 years ago

"numbersections" set in metadata doesn't do anything for docx output (well, except apparently breaking it, that is). The only format it affects, from the top of my head, is LaTeX (perhaps also beamer). So this sentence looks somewhat confused to me:

I'll also try to switch from numbersections to numberSections and will see how it further affects the output (after adapting my reference.docx).