Handle raw html tags in markdown during conversion to latex

jakobgager commented 11 years ago

Pandoc strips raw html tags present in markdown cell during the conversion to latex, see http://johnmacfarlane.net/pandoc/README.html#raw-html .

This has been adressed here: ipython/nbconvert#183

minrk commented 11 years ago

nbconvert has landed, want to make the new PR to continue the conversation?

jakobgager commented 11 years ago

Yeah, I follow the ongoing work. Great team!

Try to find some time this evening to do the PR. But I'm not sure if this is to be included in 1.0 anyway. Let's see

Update: Oh I just saw it is already assigned to 2.0! Fine than!

jakobgager commented 11 years ago

Continuation from #3570:

Summary of chat with @minrk:

Wait for input of @jdfreder, to have an idea on which kind of html tags are to be handled
Get pandoc to handle math in html inputs correctly to enable a markdown2html2latex conversion
Better communicate the "new" marked markdown syntax (docs and example notebooks) - includes tables!

Optional workaround (not to prefered):

Similar it is currently done with js, extract_math | md2html | html2latex | insert_math

Note: Many html attributes will not be handled correctly (whatever approach we will finally use) including width, height as these are not captured by pandoc.

jdfreder commented 11 years ago

Re-post from PR:

Sorry @jakobgager , this fell off my radar somehow. Off the top of my head I was having trouble with img and table html tags not being parsed.

jdfreder commented 11 years ago

When, in the future, a PR is opened to implement something like this, we will need to remember that #4090 will have to act prior to the markdown filter.

jakobgager commented 11 years ago

I recently discovered the filter option of pandoc (will be a feature of 1.12) but a workaround is shown "Scripting with pandoc". Repost from jgm/pandoc#954:

I slightly modified the json walking algorithm in jgm/pandoc-filters-python/pandoc.py see https://gist.github.com/jakobgager/2fa8d416f70155287c00 As I don't know the json format supplied by pandoc, I'm not sure if there arise some troubles! For my simple example (see below) it works with version 1.9.4.1 and 1.10.1.

With this approach and the filter https://gist.github.com/jakobgager/449484d26c3149edcf6b the input

echo 'foo $\alpha$ <i>bar</i>' | pandoc -f markdown -t html -m | pandoc -f html -t json | python conv2Math.py | pandoc -f json -t latex

leads to

foo $\alpha$ \emph{bar}

(it might be noted that the \ in the first echo call has to be escaped in csh but not in bash) It also works with more complex examples like math and markdown in html tables.

@minrk do you think we could use something like this?

jakobgager commented 11 years ago

One issue with the above approach is that this way you loose raw latex commands. But as the notebook renders raw html nicely and raw latex not at all, I guess that this drawback is not that pronounced.

ellisonbg commented 11 years ago

Raw latex tags should pass through in markdown cells and not be removed.

On Tue, Sep 3, 2013 at 2:19 PM, Jakob Gager notifications@github.comwrote:

One issue with the above approach is that this way you loose raw latex commands. But as the notebook renders raw html nicely and raw latex not at all, I guess that this drawback is not that pronounced.

— Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/issues/3503#issuecomment-23747807 .

Brian E. Granger Cal Poly State University, San Luis Obispo bgranger@calpoly.edu and ellisonbg@gmail.com

jakobgager commented 11 years ago

Well currently, raw html tags are removed which IMHO is worse than removing raw latex! Unfortunately pandoc is a bit picky about such raw inclusions.

jakobgager commented 11 years ago

Ah, I see your concern is related to your citation PR #4090.

jakobgager commented 11 years ago

Repost from jgm/pandoc#954: Ok, after digging around I found a 2-step solution which seems to work for my test cases and pandoc 1.10.1. This way raw latex and math is preserved and html tags are converted. Should work fine with citations now.

The basic call is as follows: pandoc -f markdown -t json text.mkd | ./concealLatex.py | pandoc -f json -t html -m | pandoc -f html -t json | ./revealLatex.py | pandoc -f json -t latex Quite complex, isn't it :smile: The required filters are here: https://gist.github.com/jakobgager/6473761 Requires my pandoc_2.py https://gist.github.com/jakobgager/2fa8d416f70155287c00

ellisonbg commented 11 years ago

Not sure how I feel about this level of complexity. It makes me think that it will be very fragile and that we don't have a good handle on exact what/how are are parsing. Unfortunately I am mostly offline this week so can't look at it further. @jdfreder, what do you think?

Cheers,

Brian

On Sat, Sep 7, 2013 at 1:15 AM, Jakob Gager notifications@github.comwrote:

Repost from jgm/pandoc#954 https://github.com/jgm/pandoc/issues/954: Ok, after digging around I found a 2-step solution which seems to work for my test cases and 1.10.1. This way raw latex and math is preserved and html tags are converted. Should work fine with citations now.

The basic call is as follows: pandoc -f markdown -t json text.mkd | ./concealLatex.py | pandoc -f json -t html -m | pandoc -f html -t json | ./revealLatex.py | pandoc -f json -t latex Quite complex, isn't it [image: :smile:] The required filters are here: https://gist.github.com/jakobgager/6473761 Requires my pandoc_2.py https://gist.github.com/jakobgager/2fa8d416f70155287c00

— Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/issues/3503#issuecomment-23985013 .

Brian E. Granger Cal Poly State University, San Luis Obispo bgranger@calpoly.edu and ellisonbg@gmail.com

jdfreder commented 11 years ago

@jakobgager , correct me if I'm wrong, right now this does this:

pandoc -f markdown -t json text.mkd markdown to json ./concealLatex.py convert all tex in the json to plain strings pandoc -f json -t html -m convert the json (markdown without tex) to html pandoc -f html -t json convert the results back to json. At this point all the original markdown and html is html. ./revealLatex.py convert all the plain text that looks like it was latex originally back into tex pandoc -f json -t latex convert the json (html and latex) into latex

jakobgager commented 11 years ago

@ellisonbg I totally agree with your feelings! This is currently way to much complicated, however I haven't found a better solution yet. The best way, for sure, would be to learn Haskell and try to get pandoc to NOT strip raw html :smirk:.

takluyver commented 11 years ago

So you're saying: Learn You a Haskell for Great Good? ;-)

jakobgager commented 11 years ago

@jdfreder you are right, that's what pandoc is supposed to do. Some comments:

The first markdown2json basically reads all information (including raw tex, raw html, equations, ...), so no stripping occurs at this point.
If I omit the conceal, raw latex gets stripped when converting json2html (of course raw html is preserved). As you mentioned, at this point everything is html.
The third html2json creates a dumb json (no more raw and math tags) but with tagged latex ($$$) and tagged math ($$). Finally, the reveal script converts these tags back to respective json tags. The json can be converted to latex.

jakobgager commented 11 years ago

So you're saying: Learn You a Haskell for Great Good? ;-)

To be honest, I don't understand the real meaning of this sentence but the book and website is really great! And Haskell is weird but somehow compelling!

jdfreder commented 11 years ago

@ellisonbg and @jakobgager ,

I looked into this in great detail (I have a large nb filled with different attempts to get this right).

The good

I've found that this can actually be accomplished in 3 calls to pandoc (instead of 4) and 1 python file (instead of 2):

markdown to html
html to json
python: fix json raw math tags
json to latex

The bad

Any way we approach it (from what I found), we are going to need to parse some of it ourselves using the pandoc JSON output or implement some sort of extension for pandoc in haskel. I also tried many of the relevent markdown extensions listed on pandoc's website.

The ugly

Using this input

**foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $ <b>b\$ar</b> $$test$$

And this Python code

a = json.loads(run("pandoc -f markdown -t html test.md --gladtex | pandoc -f html -t json -R"))

def traverse_json(action, json_object):
    action(json_object)
    if isinstance(json_object, dict):
        for key, value in json_object.items():
            traverse_json(action, value)
    elif isinstance(json_object, list):
        for value in json_object:
            traverse_json(action, value)

def fix_raw_math(items, html_math_type, latex_math_type):
    raw_math = -1
    raw_maths = []
    for index, item in enumerate(items):

        # Check if this is a RawInline item.
        if isinstance(item, dict) and 'RawInline' in item:
            raw_inline = item['RawInline']

            # Only read the RawInline if it is HTML.
            if raw_inline[0]=="html":
                html = raw_inline[1]
                if html == '<eq env="%s">' % html_math_type:
                    raw_math = index
                elif html == '</eq>':
                    if raw_math >= 0:
                        raw_maths.append((raw_math, index))
                    raw_math = -1

    if len(raw_maths) > 0:
        for raw_math in raw_maths[::-1]:

            # Read math
            latex = ''
            for index in range(raw_math[0] + 1, raw_math[1]):
                if isinstance(items[index], dict) and 'Str' in items[index]:
                    latex += items[index]['Str']
                elif items[index]=="Space":
                    latex += ' '

            # Remove math
            for index in range(raw_math[0], raw_math[1]+1)[::-1]:
                del items[index]

            # Re-add math
            items.insert(raw_math[0], {'Math': [latex_math_type, latex]})

def inline_math_action(value):
    if isinstance(value, list):
        fix_raw_math(value, 'math', 'InlineMath')
        fix_raw_math(value, 'displaymath', 'DisplayMath')

from IPython.nbconvert.utils.pandoc import pandoc

traverse_json(inline_math_action, a)
print(pandoc(json.dumps(a), 'json', 'latex'))

I was able to get the correct output

\textbf{foo} $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $
\textbf{b\$ar} \[test\]

jdfreder commented 11 years ago

A little more detail, running :

pandoc -f markdown -t html test.md --gladtex | pandoc -f html -t json -R

Produces this output:

[{'docAuthors': [], 'docDate': [], 'docTitle': []},
 [{'Para': [{'Strong': [{'Str': 'foo'}]},
    'Space',
    {'RawInline': ['html', '<eq env="math">']},
    {'Str': '\\left('},
    'Space',
    {'Str': '\\sum_{k=1}^n'},
    'Space',
    {'Str': 'a_k'},
    'Space',
    {'Str': 'b_k'},
    'Space',
    {'Str': '\\right)^2'},
    'Space',
    {'Str': '\\leq'},
    'Space',
    {'RawInline': ['html', '</eq>']},
    'Space',
    {'Strong': [{'Str': 'b$ar'}]},
    'Space',
    {'RawInline': ['html', '<eq env="displaymath">']},
    {'Str': 'test'},
    {'RawInline': ['html', '</eq>']}]}]]

Running the Python logic above just recognizes the RawInline HTML that pandoc adds and puts the code in the correct JSON blocks, producing this:

[{'docAuthors': [], 'docDate': [], 'docTitle': []},
 [{'Para': [{'Strong': [{'Str': 'foo'}]},
    'Space',
    {'Math': ['InlineMath',
      '\\left( \\sum_{k=1}^n a_k b_k \\right)^2 \\leq ']},
    'Space',
    {'Strong': [{'Str': 'b$ar'}]},
    'Space',
    {'Math': ['DisplayMath', 'test']}]}]]

Which pandoc can convert easily into the correct latex.

jakobgager commented 11 years ago

Great that you had a look on this too! Some comments:

Your approach highly resembles my first idea (see above) - the problem is, that the first pandoc -f markdown -t html test.md --gladtex call strips all raw latex (you haven't added any in your test case) - hence the citation PR won't work anymore. That's the reason I started with a markdown2json call.
I'm against using the -R parameter, since this introduces a high risk of getting some untranslatable html into the final latex file.

btw. what is the run method in your script?

jdfreder commented 11 years ago

I'm pretty sure the citation PR does not use raw latex, that was something we talked about at the last Google hangout meeting (or the meeting before). See citation example - https://github.com/ipython/nbconvert-examples/blob/master/citations/Tools%20for%20the%20lifecycle%20of%20computational%20research.ipynb#L47

Also, the -R only creates RawInline elements for things that can't be converted to JSON. During the stage of JSON to LaTeX, all RawInline items in the JSON get dropped religiously (since I don't use -R there). The only ones that get acknowledged are those who verbatim match what the Py script looks for.

jdfreder commented 11 years ago

btw. what is the run method in your script?

Ah, I was wondering if someone would ask about that :smile:

from IPython.utils.process import get_output_error_code
def run(command):
    stdout, stderr, retcode = get_output_error_code(command)
    return stdout

jdfreder commented 11 years ago

Your approach highly resembles my first idea (see above) - the problem is, that the first

Sorry for the confusion @jakobgager . I thought that the only latex we supported was that that was embeded in $ $ and $$ $$ environments. @ellisonbg if we want to support raw latex outside of $ $ and $$ $$ environments, I think @jakobgager 's second idea is the best way to do it now. Otherwise, his first idea or the method I posted (almost the same) would be better.

jakobgager commented 11 years ago

all RawInline items in the JSON get dropped

You are right, I missed that. I think the only raw latex we currently use is \cite, hence, a possible simplification would be to change the citation PR in a way, that the cite command does not get embedded as latex but as some tagged string. This can pass the conversion md2html2json unchanged and in the python script it gets converted to latex. So we could use your approach.

jdfreder commented 11 years ago

I think the only raw latex we currently use is \cite

I think we decided against that in one of the Google hangouts, see https://github.com/ipython/nbconvert-examples/blob/master/citations/Tools%20for%20the%20lifecycle%20of%20computational%20research.ipynb#L47 (that's how the citation PR works right now, HTML tags)

jakobgager commented 11 years ago

Of course it is entered as html tags but #4090 converts these to \cite and (actually) you posted above that

When, in the future, a PR is opened to implement something like this, we will need to remember that #4090 will have to act prior to the markdown filter.

Hence, the current markdown filter will face a raw latex \cite, No?

jdfreder commented 11 years ago

Hence, the current markdown filter will face a raw latex \cite, No?

Oh, no I didn't realize it was still doing the conversion. :confused:

Shouldn't that conversion be handled by us here, during the python filtering stuff? It would be nice if pandoc had built in support for converting cite tags to tex. I looks like http://johnmacfarlane.net/pandoc/README.html#citations is a bit much.

jakobgager commented 11 years ago

Yeah, it would be possible to do to the html citation to latex citation conversion during the python filtering, however it might be a bit more tricky as the json would consist of 3 cells instead of just 1:

{u'RawInline': [u'html', u'<cite data-cite="asdf">']},
{u'Str': u'qwer'},
{u'RawInline': [u'html', u'</cite>']}

Pandoc citation could be an option as well however I have no experience with that. If it's just to change the \cite{foo} to @foo it would be trivial. But I guess it will be more complicated. And to make pandoc support our html citation would require someone to learn Haskell - again :sunglasses:

jdfreder commented 11 years ago

however it might be a bit more tricky as the json would consist of 3 cells instead of just 1:

I have a rather generic method to do this in my solution above. The only annoying part would be parsing the HTML tag, I guess we could use lxml like @ellisonbg use's in his PR - https://github.com/ipython/ipython/pull/4090/files#L2R36

jdfreder commented 11 years ago

have a rather generic method

well, it would need to be made a little bit more generic... :stuck_out_tongue_closed_eyes:

jakobgager commented 11 years ago

I tested your approach with pandoc 1.9.4.1 and 1.10.1 and both work fine. Actually, my approach is too simple in some cases (equations with spaces :sweat:), so we should stick with yours!

Do you have pandoc-master at hand to test this as well? @jgm posted in jgm/pandoc#954 that his filters are designed to work with 1.12 (master) and won't work properly with 1.9.4.1. So, there might be some changes which could break our approach in future versions of pandoc. Unfortunately, I don't have good experiences with installing pandoc using cabal, I actually haven't succeeded a single time :smile:.

jdfreder commented 11 years ago

Do you have pandoc-master at hand to test this as well?

I don't, but I'll go ahead and try to install it. :tired_face: I'll report back with my findings

jgm commented 11 years ago

+++ Jonathan Frederic [Sep 16 13 14:16 ]:

 Do you have pandoc-master at hand to test this as well?
I don't, but I'll go ahead and try to install it. :tired_face: I'll report back with my findings

Why not just install pandoc 1.12, released yesterday?

jdfreder commented 11 years ago

Why not just install pandoc 1.12, released yesterday?

Oh nice, I'll go ahead and try that. Thanks @jgm and thank you for making such a nice tool. I use it ALL the time and wouldn't be able to do much without it!

I think it's still a good idea for me to try to install pandoc from source. I've never tried it before and I think it's a healthy exercise.

jdfreder commented 11 years ago

Just reporting back... I only encountered one small hiccup installing Pandoc from source (my fault). I forgot to prepend sudo to the cabal install pandoc (a bunch of dependencies failed, with a colorful variety of error messages). Other than that, everything went very smoothly (following the docs).

The approach I suggested above doesn't seem to work for the latest, Pandoc 1.9.4.2. I'll take a look in greater detail and report back.

jdfreder commented 11 years ago

It looks like the latest Pandoc is just a little more strict with the spacing between $ symbols and math content (correct me if I'm wrong @jgm). Since we support that space in the notebook, there already exist code to correct it.

Using **foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq$ <b>b\$ar</b> $$test$$ instead of **foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $ <b>b\$ar</b> $$test$$ produces the correct output :smile: woohoo!

@jakobgager the code that automatically fixes the spacing is defined as _strip_mathspace in IPython.nbconvert.filters.latex

jdfreder commented 11 years ago

Actually, my approach is too simple in some cases (equations with spaces ), so we should stick with yours!

@jakobgager , how about a mix of both of our approaches? That way the raw mathjax latex can still be parsed (since it looks like that is important).

jakobgager commented 11 years ago

As mentioned in #4251 we should try with some align or eqnarray environment (the & is actually important).

With respect to #4234 there should be a decision if that much latex is to be allowed or not. If yes there is clearly a need to mix our approaches.

jdfreder commented 11 years ago

I added this to the Hackpad for tomorrow's meeting- https://hackpad.com/IPython-dev-meetings-6wTSjJt7TZK

jakobgager commented 11 years ago

Great idea, however the last point is not correct. Markdown -> JSON -> Hide Latex Python Filter -> HTML +math -> JSON -> Reveal Latex Python Filter -> Latex does not drop citations (they are hidden and thus preserved) and the ambersand stuff is a pure citation2latex issue and appears with all approaches that do not drop inline latex!

jakobgager commented 11 years ago

You can also ask someone from Berkley to talk to John MacFarlane (the genius behind pandoc) in person about including html treatment when converting markdown to latex.

jdfreder commented 11 years ago

(they are hidden and thus preserved)

Sorry, you are right, I should have clarified. I'm talking about a conversion without the citation filter (I updated the hackpad).

If I do HTML -> JSON -R (or without the -R) with the following HTML

<strong data-cite="smith">Bob</strong>

I get

{u'contents': [{u'contents': u'Bob', u'tag': u'Str'}],
                  u'tag': u'Strong'}],

Which doesn't have the contents of the data-cite attribute ("smith" is nowhere to be found).

jdfreder commented 11 years ago

See https://gist.github.com/jdfreder/6734825 , must be opened in the notebook.

It renders the Pandoc and notebook Markdown outputs side by side for comparison. Some of the Pandoc output gets a bit garbled when converting from LaTeX->HTML, but not enough to ruin the results.

jakobgager commented 11 years ago

I will look at the notebook and if necessary add missing stuff.

jdfreder commented 11 years ago

add missing stuff.

Yes, please :grinning: Thanks

jakobgager commented 11 years ago

Well before getting to the bottom of the notebook I think there is one column missing. I would make one column for notebook rendering, one for html rendering (nbviewer like) and one with the resulting tex. This way its easier to get the information.

jakobgager commented 11 years ago

I really like the style!! :smile:

I would add piped tables which work with both:

compare_render(r"""
|Left |Center |Right|
|:----|:-----:|----:|
|Text1|Text2  |Text3|
""")

jdfreder commented 11 years ago

Well before getting to the bottom of the notebook I think there is one column missing. I would make one column for notebook rendering, one for html rendering (nbviewer like) and one with the resulting tex. This way its easier to get the information.

The column exists, you just need to run all of the cells :smile:

jdfreder commented 11 years ago

I know it's kind of weird that you have to run the notebook to actually see the other column. It's an unfortunate side effect of using the notebook's javascript to actually render the notebook markdown. If you look at the ugly notebook_render() code, you can see why this is

jdfreder commented 11 years ago

I would add piped tables which work with both

Ah thank you, I didn't know this worked for both let alone that it was different than the ascii tables. I'll go ahead and add it, unless you wanted to PR against my gist? - I don't know if that's possible, I don't use gists often :stuck_out_tongue:

ipython / ipython