jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.31k stars 3.37k forks source link

docx full template support #5268

Closed brainchild0 closed 5 years ago

brainchild0 commented 5 years ago

I acquired a DOCX file, from a public repository, intended for use with Pandoc as a reference document. In the document, I find headers with $ delimited variable names for substitution, and the author name preceded by the text "By ". When I use this file as a reference document, the result is that the headers are copied but with no variable substitution, and the "By" text is removed from the author line.

Reading the documentation, I learned that Pandoc does not support full templates for DOCX file generation, and that instead, only the styles are used from a reference document, with the content ignored.

In fact, this statement appears only partially true, as I observed that the headers also were copied from the reference document to the result. Putting aside the question of why the public repository supplied a reference document that works incorrectly, I am currently puzzled by the state of support for DOCX in Pandoc. Support is greater than the documented behavior that only the styles are used from the reference document, but less than full template support. Meanwhile, some time ago, Microsoft moved to the open, Zip-archive, XML-based solution for file formats, represented in the DOCX format in the case of Word, ending the period when third-party vendors were required to cope with an obtuse, closed format when attempting to interoperate with Microsoft applications.

Needless to say, the current state of support for DOCX file generation is very confusing and ambiguous for the new user of Pandoc, whereas support for MS Office files can be added to any software now by way of the open standards from which the new file formats are specified, and by navigating and manipulating the XML tree in whatever way is necessary for a particular operation.

Given these considerations, it seems quite feasible to implement full support for templates for generation of DOCX output, and I would like to request the feature.

jgm commented 5 years ago

brainchild0 notifications@github.com writes:

case of Word, ending the period when third-party vendors were required to cope with an obtuse, closed format when attempting to interoperate with Microsoft applications.

The format is open, but still obtuse, as those of us who have had to work with it can attest!

(Note to @jkr - a long time ago you had the ambition of creating a separate library defining a native Haskell representation of docx plus code to convert this to and from docx format. I think that is a great idea: it would make pandoc's docx handling more robust and maintainable. Cf. the small ipynb library I created for ipynb readers and writers.)

We should better document that headers and footers (and a few other things) carry over from the reference.docx.

As for the idea of having reference.docx work like normal pandoc templates, where variables are interpolated and control structures marked, that's not too easy because of the convoluted way in which text gets represented in docx. In ODT, one can use the opendocument template, but we have nothing corresponding in docx.

brainchild0 commented 5 years ago

I understand the obstacles, but a few notes:

  1. The XML schema may be obtuse, but practically speaking, a template processor is not necessarily completely aware of all the nuances of the format. It is the job of the target application to support all the features of the format, whereas a template processor has the separate job of inserting the user-supplied content into the template.
  2. If a third-party FOSS C++ solution is available for DOCX processing, and you are willing and able to link to it (though I don't know the limitation for linking Haskell build objects to C++), you could consider it.
  3. Knowing only the high-level details of DOCX and ODT, I am still struggling to understand the severe problem in the DOCX case. Both formats are ZIP archives populated principally with XML documents, with support for a special case of a single flat XML document instead of the archive. The flat document is the basis of the existing ODT template, and I wonder whether such is also possible for DOCX.
  4. A further idea, which would improve usability for a lay user, is a different approach for templates. Suppose that the user created a template by opening the word processor and building a document, inserting variable names and control sequences directly in the formatted text. As the user is supplying formatting details via the word processor, the job of the template engine is to identify regions of text that when represented as unformatted text, can be processed by the template engine. The simple example is that if a paragraph appears in bold typesetting "My name is $author$", the effect is that typesetting format and literal text is unchanged with only the variable substitution occurring. The processor must then have some, but not full, awareness of the target format, but creating a template is simplified because knowledge of how to synthesize a valid file in the target format, from constituent sequences of plain text, is not needed to develop a template. That is, the processor is tweaking an existing file in the target format, not building one from scratch.
jgm commented 5 years ago

Here's a discussion from pandoc-discuss which might give you a better idea of the issues that arise:

https://groups.google.com/d/msg/pandoc-discuss/ASPvqikz69E/80i86W6cBQAJ

Ophir Lifshitz's suggestion of pruning empty paragraphs might deal with the issue I noted. But note that there are many uglier cases that can arise, when people try to add various kinds of formatting. It might be worth trying this out.

The other issue with allowing user-included content in the body of the reference.docx is that it's very easy to get a corrupt docx this way. Some bits of docx content require things to be a certain way in other parts of the docx container, so it can be tricky combining a user-contributed document with the bits generated by pandoc.

brainchild0 commented 5 years ago

Ophir Lifshitz's suggestion of pruning empty paragraphs might deal with the issue I noted.

I think that suggestion is largely the same as mine, that the template is a document that the target application can create and recognize as as valid, but that the user recognizes as lacking actual content, instead having only the control sequences used by the template engine.

The main obstacle seems to be the way that the OOXML format of MS Word represents the body text not as a contiguous sequence of plain text but as a tree of three levels, namely blocks, runs, and visual features (e.g. text). Annoying indeed, but perhaps workable with a little imagination.

Consider that the basic template engine will scan a sequence of text, copying the input to output, until it recognizes a character or character range that it can interpolate or otherwise process specially. Similarly, in the OOXML case, the engine would need to navigate the XML tree following a depth-first negotiation, passing nodes directly to output unless they are identified as template elements. If so, then they might be converted into plain text and processed by the template engine. Or, if it is possible to pass a token sequence to the template engine, then it may be unnecessary to incur the penalty of reconstituting the plain text representation only to be tokenized again. In either case, however, the DFS of the XML tree is the generalization of the familiar sequential scan of a character buffer. So the solution requires additional logic, but still, I think, follows an algorithm that can be characterized in a straightforward way following known patterns.

(Note: Ideally, when XML documents undergo template processing, the template elements are themselves XML elements. XML parsing occurs first, then the parse result as a single tree is processed by the template engine. Nodes are either of the type of the target document or of the type used by the template engine to determine the logic of the template. The template engine then views nodes as nodes, not as text sequences that happen to be resolved as nodes during final XML processing. The result is a new XML tree that can be serialized to the target file. Such an approach enables the template constructs to integrate smoothly with the logical structure of the target document. )

jgm commented 5 years ago

brainchild0 notifications@github.com writes:

Consider that the basic template engine will scan a sequence of text, copying the input to output, until it recognizes a character or character range that it can interpolate or otherwise process specially. Similarly, in the OOXML case, the engine would need to navigate the XML tree following a depth-first negotiation, passing nodes directly to output unless they are identified as template elements.

Yes, but note that a single node may contain BOTH template control material and regular content. This makes things messier.

Is it possible? Of course. But messy.

brainchild0 commented 5 years ago

Yes, I agree, but when a file is processed as a single sequence of characters, that sequence contains numerous separate control tokens as well as regular content. Tokenizing a character sequence is not the part that is new in the case we are discussing, but rather it is separating the tokenization of the control elements from the semantic processing of their functions and from the type of data elements on which those functions operate. While I agree also that redesigning an engine to move the semantic operations outside the syntactical parsing, I hesitate to say, though you know the deeper issues better than I, that "messy" necessarily describes the end result, which could rather be described as cleaner and more modular.

Eliminating the need for tokenization of the text of each node could be achieved by moving the control information to dedicated XML node types. Separating logical structure from serialized representation is a driving force behind XML, so it is no accident that many XML-based template languages (e.g. XSLT) represent if, for, case, and other block types as nodes, of types defined by the language, with the conditional or loop content, carried in the children of these nodes, such that the XML parser without even knowing it, effectively creates an abstract tree ready for lexical analysis. Such a solution may be inappropriate for Pandoc (or may in fact be appropriate), but it worth understanding how these issues have been addressed in other contexts.

Developing further on the thoughts that have occurred to me during this exchange, I am inclined to explain that while studying the Pandoc template system, alongside considering my own use cases, and realizing through investigation that many users have similar use cases, I begin to notice that a more modular approach to specifying, organizing, and processing templates might ultimately be what the community needs to utilize Pandoc to its full potential.

In the current mode of operation, output is generated from two sources of input, user-supplied content, including text and metadata, and a template particular to the output format. The template serves as a specification for how to build a document of the target type. so by corollary, building such a template requires understanding of the document type. Users are free to use included templates, or, if they are inadequate, to build their own. But since included templates also tend to be limited in flexibility with respect to visual formatting considerations, users are faced with the dilemma of accepting the particular visual effect from the document, or engaging the details of the target document type to build a new template with the desired visual effect.

From a user standpoint, two separate concerns are conflated, one being the details of the file format, which has little interest to the end user, and is largely the reason for using a format conversion tool, and the other being the apparent visual formatting details, which is a central focus for most end users. This obstacle is made more profound by the observation that having success with particular visual effects in converting to a particular format format gets the user no closer to having success with the same visual effects in a different file format. Even if an objective of Pandoc is to support arbitrary conversion between any two supported file types within a single application, the current situation often feels to the user as n^2 -n separate applications within one distribution, n being the number of supported formats, such that each possible ordered pairing of distinct formats corresponds to a different set of behaviors.

I think what might be considered, if it is possible to contemplate how the software might evolve in successive generations, is a multi-tiered template system. If the user could create a file format-agnostic representation of visual formatting, to be called a visual template, which could be supplied to the application along with input text and the name of the output format, and if the application could then find the representation of the file type within the distribution, to be called a file type template, to which could be applied the input text and visual template, then users might find much more power from the application while also having a better experience.

Indeed, reading the discussion you referenced earlier, the original poster had a similar need to mine. He wanted to build a DOCX file with specific header text, subject to variable interpolation, from a MarkDown source. Ideally, he would be able to build a lightweight visual template that specified header text, and other visual details of the header such as size and whether to include it on the first page. Or he might, as I wanted, simply to specify that the author name is prepended with the word "By". He could then convert to DOCX, ODT, or PDF/Latex with equal ease. Currently, achieving such an effect even with one target format at a time tends to be very difficult.

A close analogy might be modern compilers, which feature frontend extensions for language support, and backend modules for target architecture support, such that adding support for either a new language or new architecture involves adding only one new component. The total complexity for all combinations is the sum not the product of the number of languages and architectures.

Conversion will always be hard, and idiosyncratic behaviors, documented and undocumented, will always appear in particular cases. So the dream of effortless, lossless conversion between any two types will always remain elusive, but I wonder whether it is feasible to consider new options for moving forward.

jgm commented 5 years ago

brainchild0 notifications@github.com writes:

If the user could create a file format-agnostic representation of visual formatting, to be called a visual template, which could be supplied to the application along with input text and the name of the output format, and if the application could then find the representation of the file type within the distribution, to be called a file type template, to which could be applied the input text and visual template, then users might find much more power from the application while also having a better experience.

This would amount to a new pandoc-like system that tried to represent, not document structure, but visual formatting, in a neutral way that could be converted to various formats. ("I want the title in 16-pt font, small caps, with 24 pt space between it and the author; both centered; author preceded by 'By', with a box around the whole thing.") A huge task, given the many kinds of visual formatting distinctions people might want to make, and the huge variety of ways of achieving these in different target formats. If you want to design and code such a thing, more power to you, but I consider it out of scope for pandoc.

Indeed, reading the discussion you referenced earlier, the original poster had a similar need to mine. He wanted to build a DOCX file with specific header text, subject to variable interpolation, from a MarkDown source.

This can already be done, now that the metadata fields are available in docx as properties, which can be referred to as fields in a docx header.

Indeed, if we carried over not just the header and footer, but the body text from the reference.docx, then the same trick could be done there. And that would be much cleaner than implementing a template language.

The reason we don't carry over the body text from the reference.docx is that this tended to cause corruption for the reasons given above. But perhaps there is a way around this by carefully limiting what is brought over.

brainchild0 commented 5 years ago

This would amount to a new pandoc-like system that tried to represent, not document structure, but visual formatting, in a neutral way that could be converted to various formats. ("I want the title in 16-pt font, small caps, with 24 pt space between it and the author; both centered; author preceded by 'By', with a box around the whole thing.") A huge task, given the many kinds of visual formatting distinctions people might want to make, and the huge variety of ways of achieving these in different target formats. If you want to design and code such a thing, more power to you, but I consider it out of scope for pandoc.

Yes, I never meant to represent the suggestion as anything less than a huge task that might be considered for major design iterations in the future. I would suggest though that the high number of possible distinctions "people might want to make" ought not to be a deterrent. If a core template engine can be generalized even to prove support for a few visual features, with contributions to the file format templates from the community, then support can expand through the development of the templates with enhancements to the engine occurring as needed. Separating concerns with a pluggable architecture means that each pluggable contribution has more value and requires less breadth of expertise to create, which in turn means a much higher number of external contributions. I also acknowledge, as I stated, that not every possible invocation needs to work seamlessly. The target would be a way to give the user more control of visual format with less understanding of file format, not to guarantee that every conversion gives limitless flexibility without ever experiencing a hiccup.

Indeed, reading the discussion you referenced earlier, the original poster had a similar need to mine. He wanted to build a DOCX file with specific header text, subject to variable interpolation, from a MarkDown source. This can already be done, now that the metadata fields are available in docx as properties, which can be referred to as fields in a docx header. Indeed, if we carried over not just the header and footer, but the body text from the reference.docx, then the same trick could be done there. And that would be much cleaner than implementing a template language. The reason we don't carry over the body text from the reference.docx is that this tended to cause corruption for the reasons given above. But perhaps there is a way around this by carefully limiting what is brought over.

Great. Is there an example available for the docx properties included in the header?

jgm commented 5 years ago

Great. Is there an example available for the docx properties included in the header?

I don't have one; I don't use docx unless I have to. But see #3034 for example; there are comments there discussing such uses.

I take it that Word has a standard way to insert the contents of a custom property as a field in a header.

jgm commented 5 years ago

Note: a better forum for general discussion like this is the pandoc-discuss mailing list. We like to keep the bug tracker to specific bug or enhancement requests. So I'll close this.