jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.98k stars 3.34k forks source link

Plain handled more like Para than as plain text in some output formats #7262

Open badumont opened 3 years ago

badumont commented 3 years ago

In order to set custom styles in ODT/Opendocument format, I tried to achieve the following output with a Lua filter:

<text:p text:style-name="CustomStyle">Some words.</text:p> 

I built a Plain object containing the desired tag as RawInlines like this:

function Para(para)
  para.c:insert(1, pandoc.RawInline('opendocument',
                 '<text:p text:style-name="CustomStyle">'))
  para.c:insert(pandoc.RawInline('opendocument', '</text:p>'))
  return pandoc.Plain(para.c)
end

But I end up with the following result:

pandoc -t opendocument -L test.lua <<< 'Some words.'

<text:p text:style-name="Text_20_body"><text:p text:style-name="CustomStyle">Some
words.</text:p></text:p>

So, although Plain is defined as "plain text, not a paragraph", it is handled like a Para object in Opendocument writer, which prevents from wrapping its content inside arbitrary tags. The same applies for DOCX: after having adapted the RawInlines in test.lua, we get:

    <w:p>
      <w:pPr>
        <w:pStyle w:val="FirstParagraph"/>
      </w:pPr>
      <w:p>
        <w:pPr>
          <w:pStyle w:val="CustomStyle"/>
        </w:pPr>
        <w:r>
          <w:t xml:space="preserve">Some words.</w:t>
        </w:r>
      </w:p>
    </w:p>

Is it on purpose? I acknowledge that the advantage of this behavior is that is prevents from creating invalid output. But since, to my knowledge, Plain objects can only be created in filters, one could assume that users know why they use it instead of Para, and not outputting it as fully formatted paragraphs would allow for much greater flexibility.

Also, it seems to be inconsistent across formats. The following:

[Plain [RawInline (Format "tex") "{\\bf ",Str "Some",Space,Str "words.",RawInline (Format "tex") "}"]
,Plain [RawInline (Format "tex") "{\\bf ",Str "Others.",RawInline (Format "tex") "}"]]

renders in LaTeX output:

{\bf Some words.}

{\bf Others.}

and in ConTeXt:

{\bf Some words.}
{\bf Others.}
jgm commented 3 years ago

Well, Plain is a block-level element. So some kind of block-level tag is needed in openxml, or you just have invalid openxml.

Here two design goals collide: we want Plain to be something short of a paragraph, but we also want a Plain to render validly.

kjambunathan commented 3 years ago

Here two design goals collide: we want Plain to be something short of a paragraph, but we also want a Plain to render validly.

Well there are two sort of users:

  1. the regular users, those who type out their documents and then use Pandoc for export.
  2. there are library writers, for example Emacs Orgmode exporter

People set in (1) want a well-formed document with all the bells and whistles like styles.xml, meta.xml etc etc.

People set in (2)--the library writers--are interested in using Pandoc as a citation processor. These are precisely the people which the very recent citation execuatble (culled out from pandoc-citeproc) targets. People in set (2) are interested in the citation aspects of Pandoc, and use it as a CLI tool or a library to generate text fragments (i.e., inlines as opposed to a paragraph).

ie. People in set (2) are interested in well-formed inlines only and are NOT interested in well-formed documents.

badumont commented 3 years ago

Well there are two sort of users:

  1. the regular users, those who type out their documents and then use Pandoc for export.
  2. there are library writers, for example Emacs Orgmode exporter

Perhaps surprisingly, my demand was more targetting the type 1 users, among which I am, more precisely those who use filters in order to extend the exporting capabilities of Pandoc when they export their documents. So they want to end up with a well-formed document, but may need to build arbitrary block elements with raw code. That is what I expected wrongly Plain elements were for: something like a mere list of inlines that one could wrap in whatever code we want in the target format (like tags in OOXML with custom properties).

kjambunathan commented 3 years ago

So, although Plain is defined as "plain text, not a paragraph", it is handled like a Para object in Opendocument writer, which prevents from wrapping its content inside arbitrary tags

Also, it seems to be inconsistent across formats.

Why not introduce a new Value constructor Plain'--Plain Prime--which achieves the desired result. This new value constructor could be a secret, undocumented stuff.

jgm commented 3 years ago

Another solution to this general problem would be to provide functions like writeInlineOpenXML or writeInlineOpenDocument with signature

:: PandocMonad m => WriterOptions -> [Inline] -> Text
badumont commented 3 years ago

Le Tuesday 18 May 2021 à 08:43:55AM, John MacFarlane a écrit :

Another solution to this general problem would be to provide functions like writeInlineOpenXML or writeInlineOpenDocument with signature

:: PandocMonad m => WriterOptions -> [Inline] -> Text

So that one could do something like this?

pandoc.rawBlock('openxml', '' .. writeInlineOpenXML(inlines) .. '</w:p>')

If I understand well, it would be great, indeed!

jgm commented 3 years ago

Yes that's the idea.

jgm commented 3 years ago

We would only need these functions in a few special cases where the current rendering of Plain has to include p tags (opendocument, openxml, others?).

badumont commented 3 years ago

Theoretically in all XML/SGML formats, I guess.

badumont commented 3 years ago

The more I think about it, the more I see how much power such writeInline<Something> functions would give. It would permit to manipulate the resulting string, for instance to change or add attributes. This would really help to extend Pandoc capabilities with XML-based formats.

jgm commented 3 years ago

The current DocBook writer shows another way we might go: it does render Plain as just a sequence of inlines.

% pandoc -f native -t docbook
Plain [Str "hi"]
hi

It avoids generating invalid XML by using a plainToPara function to convert Plain to Para in lists and Divs (which are the contexts in which Plain usually appears in content parsed by the markdown reader). This means, though, that invalid XML could be produced from manually constructed Pandoc structures, so it's not absolutely reliable.

One possibility would be to change all the XML-based writers so they work this way:

This would avoid the need to export a new function, though the behavior is a bit complex.

badumont commented 3 years ago

It seems to me that this wouldn't allow to build arbitrary blocks (i.e. XML elements) around the Plain's content. Or maybe you could replace "if standalone is false" by "if there are no surrounding blocks or two surrounding RawBlocks or standalone is false"?

I found also the "writeInline" way satisfying on a conceptual level, since we would have an inline inside a RawBlock element, and not three RawBlocks building one XML element. But I fully understand that implementing all the great ideas one can think about would turn the developpers' work and Pandoc itself into a nightmare...

jgm commented 3 years ago

If you're using this function in a program, then I don't really see the difference between

   inlinedoc <- writeInlineOpenDocument opts inlines

and

  inlinedoc <- writeOpenDocument opts (Pandoc nullMeta [Plain inlines])

They do the same work, no? And if you're not using this in a program, then how exactly would you be taking advantage of writeInlineOpenDocument?

badumont commented 3 years ago

So, from a filter, it would be possible to make a system call (for instance through pandoc.pipe) in order to pass this code to GHC using Pandoc's API? If so, I don't have any objection.

jgm commented 3 years ago

Right now filters have access to read but not write. It looks like what you want to do is to insert a Plain element into the AST and have it render as plain inlines; that's not something the writeInline* functions would allow you to do.

I'd like to think more about the writers that currently render Plain with paragraph tags, and see if we can't come up with an alternative approach that will be compatible with the sort of thing you're trying to do.

badumont commented 3 years ago

Thank you!

jgm commented 3 years ago

Some notes on the current treatment of Plain in XML-based writers:

So the question is whether we could move to a model like DocBook's for the others. We'd have to be very sure that we do the plainToPara transformation in every context where a Plain might be generated by our readers.

The test suite shows Plain occuring in

Also it appears as the result of parsing HTML without surrounding <p> tags (e.g. command test 4877). This could appear just about anywhere, e.g.

% pandoc -f html -t native
<blockquote>hi</blockquote>
[BlockQuote
 [Plain [Str "hi"]]]

Test case 3510 involves org:

% pandoc -f org -t native
Text

#+include: "command/3510-subdoc.org"

#+INCLUDE: "command/3510-src.hs" src haskell
#+INCLUDE: "command/3510-export.latex" export latex

More text
^D
[Para [Str "Text"]
,Header 1 ("subsection",[],[]) [Str "Subsection"]
,Para [Str "Included",Space,Str "text"]
,Plain [Str "Lorem",Space,Str "ipsum."]
,CodeBlock ("",["haskell"],[]) "putStrLn outString\n"
,RawBlock (Format "latex") "\\emph{Hello}"
,Para [Str "More",Space,Str "text"]]

cat test/command/yaml-with-chomp.md

% pandoc -s -t native
---
ml: |-
    TEST

    BLOCK
...
^D
Pandoc (Meta {unMeta = fromList [("ml",MetaBlocks [Para [Str "TEST"],Plain [Str "BLOCK"]])]})
[]

I'm not sure I see a good way to separate the Plains that will need to have paragraph tags added to produce valid HTML and the ones that won't.

bpj commented 2 years ago

@jgm so the main difference between Para and Plain is that Plain avoids added whitespace in lists and tables? Is that true of TeX output formats as well? In particular, will a there be whitespace between a RawBlock and a Plain in LaTeX output?

jgm commented 2 years ago

@bpj, you can test it yourself:

 % pandoc -t latex -f native
[Plain [Str "hi"], Plain [Str "hi"]]
hi

hi

The LaTeX writer is a bit different from others; it always inserts blank lines between block-level elements. This is sometimes undesirable, I know (#7111).

jdutant commented 2 years ago

Faced the same issue writing a Lua filter for JATS output. The new pandoc.write function helps a lot, but doesn't cover all uses cases.

I want to covert some native Divs into JATS statement elements:

<statement>
<label> inlines </label>
<title> inlines </title>
</statement>

Note that the label and title elements can't be wrapped within <p> tags and cannot contain <p> tags. The following Lua code:

function Pandoc(doc)
  inlines = pandoc.List:new(pandoc.Str('Some label text'))
  inlines:insert(1, pandoc.RawInline('jats', '<label>')
  inlines:insert(pandoc.RawInline('jats', '</label>')
  doc.blocks:insert(pandoc.Plain(inlines))
end

Generates:

<p><label>Some label text</label></p>

Label or title may contain special elements, e.g. citations, so they shouldn't be simply stringified and inserted as a RawBlock. A better approach is to use pandoc.write. We need to pass to pandoc.write at least the original citemethod and the document's metadata (for bibliography info and perhaps other settings). But we shouldn't pass all of PANDOC_WRITE_OPTIONS because (in Pandoc v2.17 and 2.18 at least) this will generate a full standalone output if Pandoc was called in standalone mode.

-- assuming doc.meta contains the document metadata
-- and label_inlines contains the label's inlines
function write_to_jats(inlines)
    local result, mini_doc
    local options = pandoc.WriterOptions({
                cite_method = PANDOC_WRITER_OPTIONS.cite_method
        })
    mini_doc = pandoc.Pandoc(pandoc.Plain(inlines), doc.meta)
    result = pandoc.write(mini_doc, 'jats', options)
    return result:match('^<p>(.*)</p>$') or result or '' -- safely remove <p> tags
end

doc.blocks:insert(pandoc.RawBlock('jats', '<label>'..write_to_jats(label_inlines)..'</label>'))

There's still a limitation, however: if the inlines needed to be processed by another filter down the line (in my use case, pandoc-crossref), they're lost.

jgm commented 2 years ago

I see the problem -- but JATS doesn't have a block-level container corresponding to Plain, so we have to treat it like Para or we'll get invalid JATS in other contexts.