jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.14k stars 3.3k forks source link

Document with comment on Title puts comment in `<title>` when using standalone #9838

Closed StephanMeijer closed 3 weeks ago

StephanMeijer commented 4 weeks ago

Explain the problem.

Document with comment on Title will put content in $pagetitle$ and $title$


Reproduction, example 1

Input document: input.docx

Command: pandoc -s --to=html --track-changes=all --template=tpl.html input.docx

Template (HTML) - `tpl.html` ``` $if(pagetitle)$ $endif$ $if(title)$ $endif$ ```
Output (HTML) ```html XxxxxxxxTitle of document" /> ```

Reproduction, example 2

Input document: input.docx

Command: pandoc -s --to=html --track-changes=all ./input.docx

Output (HTML) ```html XxxxxxxxTitle of document

XxxxxxxxTitle of document

Some text.

Heading of document

Some other text.

```

Pandoc version?

Issue on Pandoc 3.2 and Pandoc 3.1.8.

pandoc 3.2
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2024 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
StephanMeijer commented 4 weeks ago

Same behaviour for $subtitle$.

jgm commented 3 weeks ago

We downshift inline content to a plain text in generating pagetitle using stringify. Since the comment is contained in a native Span, it will be converted. We could modify stringify to ignore Spans with class comment-start, I suppose, but I'm a bit reluctant to do that, because someone might use such a class for their own purposes, outside the context of docx.Suggested workaround: specify pagetitle directly with -V.

StephanMeijer commented 3 weeks ago

We downshift inline content to a plain text in generating pagetitle using stringify. Since the comment is contained in a native Span, it will be converted. We could modify stringify to ignore Spans with class comment-start, I suppose, but I'm a bit reluctant to do that, because someone might use such a class for their own purposes, outside the context of docx.Suggested workaround: specify pagetitle directly with -V.

Using the workaround would be impossible in our situation.

We do have a workaround ourselves (post-processing the HTML), but I could see this issue being potentially problematic for other users.

Would there be any long-term solution, such as calling functions in the template to filter out these things?

jgm commented 3 weeks ago

You can use a Lua filter to filter out Spans with comment-start class in the metadata, if you don't need them to be present in the HTML (but maybe you do).

jgm commented 3 weeks ago

I don't understand "impossible," though. Surely it's possible. For example, you can render the document using a template that just contains $pagetitle$, but without --track-changes. Put this in an environment variable and then pass it into your next pandoc invocation (with --track-changes) as the value of -V pagetitle=$PAGETITLE$. That's all easily scriptable.

StephanMeijer commented 3 weeks ago

You can use a Lua filter to filter out Spans with comment-start class in the metadata, if you don't need them to be present in the HTML (but maybe you do).

We do need to have them in the HTML. Today we switched from a meta-value to the <title>-tag that allows to process this in our post-processor. Stripping out using a Pandoc filter would also be possible indeed for other users (we are moving away from Pandoc filters for other reasons).

So we are set for now, we can work around it.

I don't understand "impossible," though. Surely it's possible. For example, you can render the document using a template that just contains $pagetitle$, but without --track-changes. Put this in an environment variable and then pass it into your next pandoc invocation (with --track-changes) as the value of -V pagetitle=$PAGETITLE$. That's all easily scriptable.

That would require us to be aware of the title of the document beforehand, correct?

StephanMeijer commented 3 weeks ago

P.S.: I narrowed down the issue at hand, as this is also occuring without templates when using standalone-mode.

jgm commented 3 weeks ago

That would require us to be aware of the title of the document beforehand, correct?

No, because you can extract it from the document by writing the document using -s --template=titleonly.html where titleonly.html is just $pagetitle$.

StephanMeijer commented 3 weeks ago

That would require us to be aware of the title of the document beforehand, correct?

No, because you can extract it from the document by writing the document using -s --template=titleonly.html where titleonly.html is just $pagetitle$.

I'm not sure how it would actually help us. But anyway, for now we are set. Thanks for your help!