jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.58k stars 3.38k forks source link

Comment AST element? #1926

Open jondo opened 9 years ago

jondo commented 9 years ago

I am using Pandoc 1.13.0.1 to convert Markdown to LaTeX. My Markdown documents contain comments of the form <!-- some comment -->, that are currently stripped by Pandoc.

It would be useful to create LaTeX comments instead: % some comment.

rgaiacs commented 9 years ago

Hi @jondo,

thanks for the feedback.

Pandoc's internal AST doesn't have an representation for comments. For example, in HTML

$ pandoc -f html -t json <<EOF
<!-- Foo -->
EOF
[{"unMeta":{}},[]]

and in LaTeX

$ pandoc -f latex -t json <<EOF
% Foo
EOF
[{"unMeta":{}},[]]

Because of this there is no to preserve comments right now.

jgm commented 9 years ago

Exactly. Comments are just represented as raw HTML, so they don't appear in non-HTML formats. In principle, pandoc could add a native Comment element, but this would be quite an involved change. (Every reader and writer would need to be modified.) And even if we did this, we'd get complaints if we made it the default to convert comments to the target format, since some people may be relying on this not happening.

mhkeller commented 9 years ago

I just came across this since I'm looking for similar functionality. To address one of your points, a flag like --preserve-comments could keep it hidden for most users. My use case is I want to convert markdown files to word docs so people can more easily share them among non-markdown users.

If at all possible, this would be a valuable feature.

ghost commented 9 years ago

I am also after preservation of comments across formats. My current interest is in having latex comments converted to comments in odt output. Issue #1561 raises the related point of annotations, but I'm not sure how that could be applied to tex formats.

philbarresi commented 9 years ago

As for:

In principle, pandoc could add a native Comment element, but this would be quite an involved change. (Every reader and writer would need to be modified.)

and:

To address one of your points, a flag like --preserve-comments could keep it hidden for most users

If someone can point me in the right direction, I can take a crack at these; comments would be a massive bonus for me, personally.

jgm commented 9 years ago

You could write a pandoc filter that passes your HTML comments through to latex as latex comments.

Something like

import Text.Pandoc.JSON

main = toJSONFilter commentHtmlToTeX

commentHtmlToTeX :: Block -> Block
commentHtmlToTeX (RawBlock (Format "html") ('<':'!':'-':'-':xs)) =
  RawBlock (Format "latex")
     (unlines $ map ("% "++) $ lines $ take (length xs - 3) xs
commentHtmlToTeX x = x

This would allow you to pass through comments without any change in pandoc itself.

bamcdougall commented 9 years ago

JGM--That is a clever solution!

jrk commented 7 years ago

(Transferring/merging from #3187 as requested.)

I realize most uses of Pandoc are one-way and display-format-oriented, but it is such a rich transformation system that it can be very valuable to capture all available information where possible in readers, rather than dropping or flattening it during reading.

Pandoc readers already do this to a very large degree, and even for comments, the Markdown reader reads comments as raw HTML blocks which can be suppressed by default when going to other targets. The LaTeX reader, however, does not seem to have a way to preserve comments.

It would be useful for (in my case) LaTeX <==> (extended) Markdown round tripping to be able to capture comments when reading LaTeX. I'm glad to handle them in my own desired way using my own filter scripts, but even that is not currently possible since they're elided on reading. It's not obvious why they could not also be parsed as raw TeX strings, starting with %, just as Markdown/HTML comments are raw HTML nodes internally enclosed by the text <!---->.


In short, as a useful half-step still well short of supporting comments as a general new node type, it would be valuable for the LaTeX reader/writer, in particular, to support comments as Raw TeX blocks as Markdown/HTML do with raw HTML comments.

jgm commented 7 years ago

+++ Jonathan Ragan-Kelley [Dec 04 16 17:19 ]:

reading. It's not obvious why they could not also be parsed as raw TeX strings, starting with %, just as Markdown/HTML comments are raw HTML nodes internally enclosed by the text .

Unfortunately, that's not going to work by itself, because raw tex gets rendered in Markdown, where the % will be interpreted as a % sign.

The best we could do would be to have "comment" environments (\begin{comment}...\end{comment}) included a raw LaTeX (at least when --parse-raw is specified), instead of just omitted. I don't know if that would help for you.

ickc commented 7 years ago

Note that the comment environment requires the verbatim package which pandoc is not currently depending on.

jrk commented 7 years ago

For my uses, at least, I don't care that RawTeX passes through to Markdown, since I'm mostly interested in going this direction with my own filters in the pipeline. I'm sure I'm in the minority, but I think attention should be paid to uses of Pandoc as a semantic parsing and transformation engine, not just a black-box converter which must always give the desired output directly using only its own internal processes and defaults.

And unfortunately the comment environment isn't sufficient, since I am round-tripping with standard LaTeX written by others.

ickc commented 7 years ago

@jrk

... not just a black-box converter which must always give the desired output directly using only its own internal processes and defaults.

I don't entirely understand this statement, but if it seems like "it's own internal processes" refers to the pandoc AST and "defaults" or being a "black-box converter" refer to the customizability. The former one is a design choice, however the AST is changed and improved, what pandoc can do should always implied and limited by the AST. "Hacking" beyond what the AST allows will be the job of pre/post-processors/filters. And about customizability, while pandoc already has seas of command line options (so it is not a black-box nor only have "defaults"), there will always be situations those customizability is not enough.

... since I am round-tripping with standard LaTeX written by others.

It sounds like you're using beyond what pandoc is designed for. But I'm curious: did you round-tripping with pandoc with success in some cases? You sound like you're already relying on this behavior from pandoc with success.

A quote from the manual:

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc’s simple document model. While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

Going back to the comment issue, the "pandoc way" to do it is to make an "AST change" which defines a new comment element. (By the way, should this issue has the "AST Change" label?) So getting comments work across formats is not unachievable (albeit difficult). But your expectation on pandoc in general (if I understand you correctly above) is a mission impossible. I have also pushed pandoc beyond what it is designed for and some cases have success, but we're pretty much on our own (and pandoc-discuss) in this case.

jdittrich commented 6 years ago

I would find an AST representation of comments also very useful since comments are an element that is present in many text-based (e.g. HTML) and WYSIWYG (e.g. docx) formats.

We would need to consider if we would/could support anchor+selection style of comments (which is usually visible as highlighted text, e.g. in Word)

naught101 commented 6 years ago

I'd also love comments to be useful. I often use them to structure documents visually (e.g. paragraph titles in comments), and make it easier to read over. It would be useful to still have those when converting from markdown to latex, for instance.

Presumably having them in the AST would also make it easier to write filters that would e.g. convert HTML comments in to word-doc comments, which would be useful when sharing with coauthors sometimes (I currently use the todonotes latex package, but that obviously doesn't work when converting to docx or similar).

fgasperij commented 4 years ago

You could write a pandoc filter [...]. This would allow you to pass through comments without any change in pandoc itself.

@jgm Could you please explain me how is that a possible workaround if there's no AST representation for comments and the filters, AFAIU, work on Pandoc's AST?

My best guess is that there is that most of the time there's a one-to-one match from RawBlocks to the elements of the input that have no corresponding output, such as HTML comments, since you have to identify them to not include them. So a filter can detect them by checking their prefix. If this is the case I think it's super useful to know it and wonder why you chose not to include this fact in the docs (at least I wasn't able to find it).

jgm commented 4 years ago

The filter matches on a RawBlock and emits a RawBlock, yes.