jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.56k stars 3.38k forks source link

Enhancement: Hide metadata header in markdown #7183

Open sdbbs opened 3 years ago

sdbbs commented 3 years ago

I would like to propose, as an enhancement, the same approach taken here Hide metadata header in markdown · Issue #527 · mwouts/jupytext :

Starting from Jupytext 1.6.0, the metadata header in Jupytext Markdown notebooks will look like this:

<!--
jupyter:
  jupytext:
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.2'
      jupytext_version: 1.4.2
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
-->

and thus the metadata will be hidden on GitHub and also when the .md file is rendered as HTML.

In other words - allow that, instead of the default opening and closing "three-dashes" (---) strings, that define start and end of a YAML header block in Pandoc Markdown - the opening (<!--) and closing (-->) tags for HTML comments are used. In that way, the header would still be interpreted by Pandoc - while being fully hidden from typical automatic online parsers of Markdown to HTML (such as GitHub's).

Alternatively, allow that the very first line in a Pandoc Markdown document can start with an HTML comment, and that the starting --- of a Pandoc Markdown YAML header can be on the second line of the Markdown text file; in that way, probably most of the code that parses the YAML header can be kept (including starting and stopping ---), while still allowing for hiding the YAML header from online Markdown parsers.

jgm commented 3 years ago

You can do -t gfm-yaml_metadata_block and the metadata block will be omitted.

jgm commented 3 years ago

Or, if you want the metadata in an HTML comment, here's another trick you can already do: create a template ipynb.markdown as follows:

<!--
$meta-json$
-->
$body$

Then

pandoc my.ipynb --template ipynb.markdown -t gfm-yaml_metadata_block
<!--
{"jupyter":{"jupytext":{"text_representation":{"format_version":"1.2","jupytext_version":"1.4.2","extension":".md","format_name":"markdown"}},"kernelspec":{"display_name":"Python 3","name":"python3","language":"python"}}}
-->
my doc
jgm commented 3 years ago

By the way, I kind of like the idea of putting metadata inside an HTML comment. I suggested exactly this in 2011 on the markdown-discuss mailing list.

In principle, we could create a new extension, yaml_metadata_in_html_comment, that enables this (for both input and output). But I'm reluctant to add to the gratuitous proliferation of syntax extensions.

sdbbs commented 3 years ago

Hi @jgm,

Many thanks for the feedback - and sorry I could not respond earlier!

You can do -t gfm-yaml_metadata_block and the metadata block will be omitted.

Was not aware of that option - however, I think it only helps if it is pandoc creating the HTML; what I want to do instead, is use a Markdown file otherwise intended for pandoc, in an automatic online Markdown->HTML parser, such as Github's.

Here is an example: I have an .md file, that is intended as a source for pandoc, with the intended pandoc output being PDF via Latex. However, I also keep this file in git, and in my online repository, I use https://github.com/gitbucket/gitbucket as a web interface to my git repositories.

When I access GitBucket, and try to open this .md file, I get something like this:

gitbucket_pandoc_md

In other words - the Markdown-HTML parser of Gitbucket did not recognize the YAML header block, and started interpreting eveything inside it as Markdown. Specifically, I have a line in the header:

# lines starting with # are YAML-level comments!

... and indeed, pandoc interprets this fine as a comment inside the YAML header - however, Gitbuckets Markdown parser intepreted it as plain Markdown, that is, it intepreted it as a heading.

So, if we could alternatively use say <!--- and ---> (note, three dashes!) as opening and closing of a YAML header block in a Markdown file in pandoc, then:

Or, if you want the metadata in an HTML comment, here's another trick you can already do: create a template ipynb.markdown as follows:

Thanks - that seems to be specific to Jupyter notebooks; I haven't really tried it, but it does not look to me, that it would help with my use case ( I want to keep a YAML header block in .md file, while hiding it from other Markdown parsers).

In principle, we could create a new extension, yaml_metadata_in_html_comment, that enables this (for both input and output). But I'm reluctant to add to the gratuitous proliferation of syntax extensions.

I guess that a new extension would help my use case personally - however, I see your point with "gratuitous proliferation", and I agree with it... So, maybe my suggestion above is worth considering:

... and all this "built-in" pandoc (i.e. without enabling an extension) -- and all other Markdown parsers would see a HTML comment here instead, and thus not process the text content of the YAML block.

alerque commented 3 years ago

With all due respect I think the onus should be on your other parser to support YAML meta data, not on Pandoc to hide it. If it doesn't need to do anything with it all they need to do is spot the standard YAML separators and discard the block. This is a very standard extension to Markdown and used by many many parsers. If you need to support something less featured then some kind of build step that exports the variant you need should be considered par for the course.

Hidden behind a non-default option flag I couldn't actually object to this being a "feature", but both the proliferation of options and the proliferation of format variants seems like a bad thing to me.

tarleb commented 3 years ago

Playing my broken "Lua filter" record again: if all else fails, here's a filter to make pandoc work with the syntax proposed by @sdbbs:

-- file: yaml-in-html-comments.lua
local meta

function RawBlock (raw)
  if raw.format == 'html' and raw.text:match '%<%!%-%-%-'then
    local yaml = raw.text:gsub('^<!%-%-%-', '---'):gsub('%-%-%->$', '---')
    meta = pandoc.read(yaml, 'markdown+yaml_metadata_block').meta
  end
end

-- set as document's metadata; could also do a merge instead (if necessary).
function Meta (_) return meta end

Use with pandoc --lua-filter=yaml-in-html-comments.lua ....

tarleb commented 3 years ago

I think the Lua filter solution should work well enough, so I'm closing this. Please reopen if the proposed solution proves to be insufficient.

jgm commented 3 years ago

I'd like to keep this open for further consideration.

sdbbs commented 3 years ago

Thanks all for the comments:

With all due respect I think the onus should be on your other parser to support YAML meta data, not on Pandoc to hide it.

Yes, I should have mentioned, that I didn't easily decide to post this, because it obviously would increase the work/support load on the pandoc project - which as a happy user otherwise, I'd like to avoid.

This is a very standard extension to Markdown and used by many many parsers.

OK, I was not aware of this, thanks for mentioning it.

However, gitbuckets parser at least does not support it (yet); and my thinking was: if other platforms advertise simply "Markdown", and I tried to ask them for this enhancement (i.e. add code in their parsers that would ignore YAML headers), they could always point to the original Markdown spec https://daringfireball.net/projects/markdown/ and say that there is no mention of --- or YAML headers there.

Hidden behind a non-default option flag I couldn't actually object to this being a "feature", but both the proliferation of options and the proliferation of format variants seems like a bad thing to me.

Fully agree there.

But now that I have seen the lua filter in https://github.com/jgm/pandoc/issues/7183#issuecomment-821777277 - I actually think I could live with it, since I use lua filters in my workflow anyways; so I guess, that particular lua filter solves my problem.

alerque commented 3 years ago

These days the Common Mark project is a much better place to point projects toward if you want them to have interoperable Markdown than the original Daring Fireball post, but you do have a point — as widespread as YAML meta data is (used by many publishing platforms, static side generators, even Markdown note taking applications!) it is still an extension to Markdown not part of Markdown itself. Even CommonMark thinks of it that way. The Pandoc flavor includes it by default, but having a way to wrap the extra data in a way that any CommonMark compatible parser would not break would be an interesting extension.

sdbbs commented 3 years ago

Thanks, @alerque :

These days the Common Mark project is a much better place to point projects toward if you want them to have interoperable Markdown

Thanks, good to know this!

Btw, I just found something going against my suggestion of <!--- (triple dash) as alternative for opening tag for YAML:

https://stackoverflow.com/questions/4823468/comments-in-markdown

I use standard HTML tags, like

<!---
your comment goes here
and here
-->

Note the triple dash. The advantage is that it works with pandoc when generating TeX or HTML output. More information is available on the pandoc-discuss group.

Not sure if this is still applicable though, tried <!--- vs <!-- on multiline (as in, \n line) text in my doc in pandoc 2.13, they both seemed to work fine. But in any case, there is a historical precedent of using <!--- for something else.

alerque commented 3 years ago

Triple dashes being treated differently was probably a bug. HTML comments are a nightmare to parse. Did you know -- is a field separator in comments? Yes comments have fields. And the get parsed for other things too. Some browsers overload them, some servers use them as preprocessing hints, and so on. They are minefields. In any case I don't think triple dashes are a good way to overload comments.