Open amueller opened 8 years ago
+++ Andreas Mueller [Feb 15 16 11:50 ]:
It would be really really great if there was an option to parse html inside markdown. I (and many other people) are using jupyter notebook to write complex documents. These are written in markdown, rendered in html and converted to other formats using pandoc.
However, when converting tables or image tags or any other html included in the markdown to, say latex, all the formatting is lost. That makes it really hard to create complex documents, as markdown is not a very powerful layout language (there is not standard way to specify image sizes, tables can't have colspan or rowspan etc).
The difficulty is that in many of these cases, the Pandoc data structure (the intermediate format that pandoc uses in all its conversions) shares the expressive limitations of Markdown. (After all, it started as a way to represent Markdown.) It doesn't currently allow colspans or rowspans.
Note that there IS now a standard way in pandoc's Markdown to specify image sizes. (As of 1.16.x)
I'm having a hard time thinking of things that can be expressed in HTML and can be represented in the Pandoc AST, but can't be expressed in pandoc's Markdown.
but pandoc can convert html to latex, right? I was operating under the assumption that that would "work", i.e. rowspan and colspan would be preserved. But it looks like the table formatting is actually discarded. Is that right?
I saw the addition of the image size. That's great, though the trouble with that is that it is pandoc specific, so using any other way to render the markdown will likely not interpret that correctly.
All conversions go:
source format -> Pandoc structure -> target format
This is how we can do hundreds of conversions without writing hundreds of converters -- we just need to write a parser for each format to a Pandoc structure, and a writer from a Pandoc to each format.
So all conversions are limited by the expressive limitations of the Pandoc structure. (See the beginning of the README where this is made explicit.)
+++ Andreas Mueller [Feb 15 16 12:10 ]:
but pandoc can convert html to latex, right? I was operating under the assumption that that would "work", i.e. rowspan and colspan would be preserved. But it looks like the table formatting is actually discarded. Is that right?
— Reply to this email directly or [1]view it on GitHub.
References
Yeah that makes total sense. Are there any plans to enrich the pandoc internal format?
+++ Andreas Mueller [Feb 15 16 13:47 ]:
Yeah that makes total sense. Are there any plans to enrich the pandoc internal format?
Yes, but every change is a lot of work, because all the readers and writers need to be updated (as well as just about every other part of the project).
With just a couple exceptions, we've limited the AST to what is representable in Pandoc's Markdown. Since we don't have a good way of doing colspans and rowspans in tables in Markdown, we haven't added that. But it might be worth doing so anyway.
Is this open somewhere else, or closed due to lack of interest and/or level of difficulty?
Is the general issue of parsing inline HTML not addressable? I've taken a stab at just passing RawInline blocks through pandoc --to json
in a filter, and it seems to work for image tags, though I've no idea where it will fall down.
My super-simplistic filter:
#!/usr/bin/env python
"""
pandoc filter for handling inline HTML in markdown
Seems to work for simple image tags, at least
"""
import json
import sys
from pandocfilters import toJSONFilter
from nbconvert.utils.pandoc import pandoc
def inline_html(key, value, format, meta):
if key != 'RawInline':
return
raw_format, raw_html = value
new_json = pandoc(raw_value, raw_format, 'json')
new_data = json.loads(new_json)
return new_data['blocks']
if __name__ == '__main__':
toJSONFilter(inline_html)
This replaces any RawInline block with the result of passing it through pandoc with the associated reader. It seems to do what I want for simple cases (image tags), but the RawInline blocks are segmented in such a way that nontrivial bits of HTML (i.e. more than a single tag) will probably not be handled correctly.
The problem isn't that this can't be done, technically.
It's more a question whether this would be desirable.
After all, in Markdown you're supposed to be able to pass through raw HTML. But the result of (writeHtml . readHtml) in pandoc might be different from the original. Pandoc can throw away information, e.g. attributes. So authors might find that they wrote one thing as raw HTML but got something else in the HTML output.
This is why it would be problematic to parse raw HTML inside the Markdown reader. (Note also that the reader doesn't know what the output format will be, and so it can't implement the rule "leave it alone if we're going to HTML.")
I suppose one could implement this as a filter run in pandoc.hs (for non-HTML output) or even in individual writers. However, some people take advantage of the present behavior (where raw HTML simply won't appear, e.g., in LaTeX output) by putting raw HTML and raw LaTeX side by side, knowing that only one will appear in the html or latex output.
+++ Min RK [Jan 25 17 07:14 ]:
Is the general issue of parsing inline HTML not addressable? I've taken a stab at just passing RawInline blocks through pandoc --to json in a filter, and it seems to work for image tags, though I've no idea where it will fall down.
— You are receiving this because you modified the open/close state. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.
References
I find just throwing away content of a supported input type pretty odd behavior. While it makes sense for html output, as you pointed out, the reader doesn't know the output and it doesn't make a lot of sense for any other output format.
Using the implicit dropping of some content seems like a pretty bad way to support different mime-types. I'm not sure if there's something in markdown that would enable declaring raw types like in ReST.
I really don't like this particular feature of Markdown, and I find RST better designed in this respect.
However, it is part of the Markdown syntax description that you can insert raw HTML, and it's passed through unchanged to the (HTML) target. So we have to respect that, I think.
+++ Andreas Mueller [Jan 25 17 10:06 ]:
I find just throwing away content of a supported input type pretty odd behavior. While it makes sense for html output, as you pointed out, the reader doesn't know the output and it doesn't make a lot of sense for any other output format.
Using the implicit dropping of some content seems like a pretty bad way to support different mime-types. I'm not sure if there's something in markdown that would enable declaring raw types like in ReST.
— You are receiving this because you modified the open/close state. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.
References
Ran into this as well. Discovered that a <table>
inside markdown is completely stripped of tags and therefore ends up looking very wrong (man
target). I think it would be nice to be able to opt in to parsing the HTML.
I'll reopen this for further consideration.
I'd like this as well to support parsing inline HTML in Commonmark.
For instance, it would be great to be able to parse
inline <a href="path">**markdown**</a> link
as
inline [**markdown**](path) link
It'd also be great to have this work with --file-scope
such that running
pandoc -f commonmark+attributes --file-scope one.md two.md
with two.md
an empty file to force --file-scope
to take effect and one.md
containing the following
# Header {#heading}
inline <div id="heading2"></div>
inline <a href="#heading">**markdown**</a> link
inline [**markdown**](#heading) link
results in
<div id="one.md">
<h1 id="one.md__heading">Header</h1>
<p>
inline <div id="one.md__heading2"></div>
inline <a href="#one.md__heading"><strong>markdown</strong></a> link
inline <a href="#one.md__heading"><strong>markdown</strong></a> link
</p>
</div>
instead of the current
<div id="one.md">
<h1 id="one.md__heading">Header</h1>
<p>
inline <div id="heading2"></div>
inline <a href="#heading"><strong>markdown</strong></a> link
inline <a href="#one.md__heading"><strong>markdown</strong></a> link
</p>
</div>
(where the ID of the raw <div>
and the target of the raw <a>
aren't updated)
It would be really really great if there was an option to parse html inside markdown. I (and many other people) are using jupyter notebook to write complex documents. These are written in markdown, rendered in html and converted to other formats using pandoc.
However, when converting tables or image tags or any other html included in the markdown to, say latex, all the formatting is lost. That makes it really hard to create complex documents, as markdown is not a very powerful layout language (there is not standard way to specify image sizes, tables can't have colspan or rowspan etc).
It would be great to be able to leverage html for that, in particular as pandoc already has html support.
Thanks!