Closed jakobgager closed 10 years ago
nbconvert has landed, want to make the new PR to continue the conversation?
Yeah, I follow the ongoing work. Great team!
Try to find some time this evening to do the PR. But I'm not sure if this is to be included in 1.0 anyway. Let's see
Update: Oh I just saw it is already assigned to 2.0! Fine than!
Continuation from #3570:
Summary of chat with @minrk:
Optional workaround (not to prefered):
Note: Many html attributes will not be handled correctly (whatever approach we will finally use) including width, height as these are not captured by pandoc.
Re-post from PR:
Sorry @jakobgager , this fell off my radar somehow. Off the top of my head I was having trouble with img and table html tags not being parsed.
When, in the future, a PR is opened to implement something like this, we will need to remember that #4090 will have to act prior to the markdown filter.
I recently discovered the filter option of pandoc (will be a feature of 1.12) but a workaround is shown "Scripting with pandoc". Repost from jgm/pandoc#954:
I slightly modified the json walking algorithm in jgm/pandoc-filters-python/pandoc.py see https://gist.github.com/jakobgager/2fa8d416f70155287c00 As I don't know the json format supplied by pandoc, I'm not sure if there arise some troubles! For my simple example (see below) it works with version 1.9.4.1 and 1.10.1.
With this approach and the filter https://gist.github.com/jakobgager/449484d26c3149edcf6b the input
echo 'foo $\alpha$ <i>bar</i>' | pandoc -f markdown -t html -m | pandoc -f html -t json | python conv2Math.py | pandoc -f json -t latex
leads to
foo $\alpha$ \emph{bar}
(it might be noted that the \ in the first echo call has to be escaped in csh but not in bash) It also works with more complex examples like math and markdown in html tables.
@minrk do you think we could use something like this?
One issue with the above approach is that this way you loose raw latex commands. But as the notebook renders raw html nicely and raw latex not at all, I guess that this drawback is not that pronounced.
Raw latex tags should pass through in markdown cells and not be removed.
On Tue, Sep 3, 2013 at 2:19 PM, Jakob Gager notifications@github.comwrote:
One issue with the above approach is that this way you loose raw latex commands. But as the notebook renders raw html nicely and raw latex not at all, I guess that this drawback is not that pronounced.
— Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/issues/3503#issuecomment-23747807 .
Brian E. Granger Cal Poly State University, San Luis Obispo bgranger@calpoly.edu and ellisonbg@gmail.com
Well currently, raw html tags are removed which IMHO is worse than removing raw latex! Unfortunately pandoc is a bit picky about such raw inclusions.
Ah, I see your concern is related to your citation PR #4090.
Repost from jgm/pandoc#954: Ok, after digging around I found a 2-step solution which seems to work for my test cases and pandoc 1.10.1. This way raw latex and math is preserved and html tags are converted. Should work fine with citations now.
The basic call is as follows:
pandoc -f markdown -t json text.mkd | ./concealLatex.py | pandoc -f json -t html -m | pandoc -f html -t json | ./revealLatex.py | pandoc -f json -t latex
Quite complex, isn't it :smile:
The required filters are here: https://gist.github.com/jakobgager/6473761
Requires my pandoc_2.py https://gist.github.com/jakobgager/2fa8d416f70155287c00
Not sure how I feel about this level of complexity. It makes me think that it will be very fragile and that we don't have a good handle on exact what/how are are parsing. Unfortunately I am mostly offline this week so can't look at it further. @jdfreder, what do you think?
Cheers,
Brian
On Sat, Sep 7, 2013 at 1:15 AM, Jakob Gager notifications@github.comwrote:
Repost from jgm/pandoc#954 https://github.com/jgm/pandoc/issues/954: Ok, after digging around I found a 2-step solution which seems to work for my test cases and 1.10.1. This way raw latex and math is preserved and html tags are converted. Should work fine with citations now.
The basic call is as follows: pandoc -f markdown -t json text.mkd | ./concealLatex.py | pandoc -f json -t html -m | pandoc -f html -t json | ./revealLatex.py | pandoc -f json -t latex Quite complex, isn't it [image: :smile:] The required filters are here: https://gist.github.com/jakobgager/6473761 Requires my pandoc_2.py https://gist.github.com/jakobgager/2fa8d416f70155287c00
— Reply to this email directly or view it on GitHubhttps://github.com/ipython/ipython/issues/3503#issuecomment-23985013 .
Brian E. Granger Cal Poly State University, San Luis Obispo bgranger@calpoly.edu and ellisonbg@gmail.com
@jakobgager , correct me if I'm wrong, right now this does this:
pandoc -f markdown -t json text.mkd
markdown to json
./concealLatex.py
convert all tex in the json to plain strings
pandoc -f json -t html -m
convert the json (markdown without tex) to html
pandoc -f html -t json
convert the results back to json. At this point all the original markdown and html is html.
./revealLatex.py
convert all the plain text that looks like it was latex originally back into tex
pandoc -f json -t latex
convert the json (html and latex) into latex
@ellisonbg I totally agree with your feelings! This is currently way to much complicated, however I haven't found a better solution yet. The best way, for sure, would be to learn Haskell and try to get pandoc to NOT strip raw html :smirk:.
So you're saying: Learn You a Haskell for Great Good? ;-)
@jdfreder you are right, that's what pandoc is supposed to do. Some comments:
So you're saying: Learn You a Haskell for Great Good? ;-)
To be honest, I don't understand the real meaning of this sentence but the book and website is really great! And Haskell is weird but somehow compelling!
@ellisonbg and @jakobgager ,
I looked into this in great detail (I have a large nb filled with different attempts to get this right).
I've found that this can actually be accomplished in 3 calls to pandoc (instead of 4) and 1 python file (instead of 2):
Any way we approach it (from what I found), we are going to need to parse some of it ourselves using the pandoc JSON output or implement some sort of extension for pandoc in haskel. I also tried many of the relevent markdown extensions listed on pandoc's website.
Using this input
**foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $ <b>b\$ar</b> $$test$$
And this Python code
a = json.loads(run("pandoc -f markdown -t html test.md --gladtex | pandoc -f html -t json -R"))
def traverse_json(action, json_object):
action(json_object)
if isinstance(json_object, dict):
for key, value in json_object.items():
traverse_json(action, value)
elif isinstance(json_object, list):
for value in json_object:
traverse_json(action, value)
def fix_raw_math(items, html_math_type, latex_math_type):
raw_math = -1
raw_maths = []
for index, item in enumerate(items):
# Check if this is a RawInline item.
if isinstance(item, dict) and 'RawInline' in item:
raw_inline = item['RawInline']
# Only read the RawInline if it is HTML.
if raw_inline[0]=="html":
html = raw_inline[1]
if html == '<eq env="%s">' % html_math_type:
raw_math = index
elif html == '</eq>':
if raw_math >= 0:
raw_maths.append((raw_math, index))
raw_math = -1
if len(raw_maths) > 0:
for raw_math in raw_maths[::-1]:
# Read math
latex = ''
for index in range(raw_math[0] + 1, raw_math[1]):
if isinstance(items[index], dict) and 'Str' in items[index]:
latex += items[index]['Str']
elif items[index]=="Space":
latex += ' '
# Remove math
for index in range(raw_math[0], raw_math[1]+1)[::-1]:
del items[index]
# Re-add math
items.insert(raw_math[0], {'Math': [latex_math_type, latex]})
def inline_math_action(value):
if isinstance(value, list):
fix_raw_math(value, 'math', 'InlineMath')
fix_raw_math(value, 'displaymath', 'DisplayMath')
from IPython.nbconvert.utils.pandoc import pandoc
traverse_json(inline_math_action, a)
print(pandoc(json.dumps(a), 'json', 'latex'))
I was able to get the correct output
\textbf{foo} $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $
\textbf{b\$ar} \[test\]
A little more detail, running :
pandoc -f markdown -t html test.md --gladtex | pandoc -f html -t json -R
Produces this output:
[{'docAuthors': [], 'docDate': [], 'docTitle': []},
[{'Para': [{'Strong': [{'Str': 'foo'}]},
'Space',
{'RawInline': ['html', '<eq env="math">']},
{'Str': '\\left('},
'Space',
{'Str': '\\sum_{k=1}^n'},
'Space',
{'Str': 'a_k'},
'Space',
{'Str': 'b_k'},
'Space',
{'Str': '\\right)^2'},
'Space',
{'Str': '\\leq'},
'Space',
{'RawInline': ['html', '</eq>']},
'Space',
{'Strong': [{'Str': 'b$ar'}]},
'Space',
{'RawInline': ['html', '<eq env="displaymath">']},
{'Str': 'test'},
{'RawInline': ['html', '</eq>']}]}]]
Running the Python logic above just recognizes the RawInline HTML that pandoc adds and puts the code in the correct JSON blocks, producing this:
[{'docAuthors': [], 'docDate': [], 'docTitle': []},
[{'Para': [{'Strong': [{'Str': 'foo'}]},
'Space',
{'Math': ['InlineMath',
'\\left( \\sum_{k=1}^n a_k b_k \\right)^2 \\leq ']},
'Space',
{'Strong': [{'Str': 'b$ar'}]},
'Space',
{'Math': ['DisplayMath', 'test']}]}]]
Which pandoc can convert easily into the correct latex.
Great that you had a look on this too! Some comments:
pandoc -f markdown -t html test.md --gladtex
call strips all raw latex (you haven't added any in your test case) - hence the citation PR won't work anymore. That's the reason I started with a markdown2json call.-R
parameter, since this introduces a high risk of getting some untranslatable html into the final latex file.btw. what is the run
method in your script?
I'm pretty sure the citation PR does not use raw latex, that was something we talked about at the last Google hangout meeting (or the meeting before). See citation example - https://github.com/ipython/nbconvert-examples/blob/master/citations/Tools%20for%20the%20lifecycle%20of%20computational%20research.ipynb#L47
Also, the -R
only creates RawInline elements for things that can't be converted to JSON. During the stage of JSON to LaTeX, all RawInline items in the JSON get dropped religiously (since I don't use -R
there). The only ones that get acknowledged are those who verbatim match what the Py script looks for.
btw. what is the run method in your script?
Ah, I was wondering if someone would ask about that :smile:
from IPython.utils.process import get_output_error_code
def run(command):
stdout, stderr, retcode = get_output_error_code(command)
return stdout
Your approach highly resembles my first idea (see above) - the problem is, that the first
Sorry for the confusion @jakobgager . I thought that the only latex we supported was that that was embeded in $ $ and $$ $$ environments. @ellisonbg if we want to support raw latex outside of $ $ and $$ $$ environments, I think @jakobgager 's second idea is the best way to do it now. Otherwise, his first idea or the method I posted (almost the same) would be better.
all RawInline items in the JSON get dropped
You are right, I missed that.
I think the only raw latex we currently use is \cite
, hence, a possible simplification would be to change the citation PR in a way, that the cite command does not get embedded as latex but as some tagged string. This can pass the conversion md2html2json unchanged and in the python script it gets converted to latex.
So we could use your approach.
I think the only raw latex we currently use is \cite
I think we decided against that in one of the Google hangouts, see https://github.com/ipython/nbconvert-examples/blob/master/citations/Tools%20for%20the%20lifecycle%20of%20computational%20research.ipynb#L47 (that's how the citation PR works right now, HTML tags)
Of course it is entered as html tags but #4090 converts these to \cite and (actually) you posted above that
When, in the future, a PR is opened to implement something like this, we will need to remember that #4090 will have to act prior to the markdown filter.
Hence, the current markdown filter will face a raw latex \cite, No?
Hence, the current markdown filter will face a raw latex \cite, No?
Oh, no I didn't realize it was still doing the conversion. :confused:
Shouldn't that conversion be handled by us here, during the python filtering stuff? It would be nice if pandoc had built in support for converting cite tags to tex. I looks like http://johnmacfarlane.net/pandoc/README.html#citations is a bit much.
Yeah, it would be possible to do to the html citation to latex citation conversion during the python filtering, however it might be a bit more tricky as the json would consist of 3 cells instead of just 1:
{u'RawInline': [u'html', u'<cite data-cite="asdf">']},
{u'Str': u'qwer'},
{u'RawInline': [u'html', u'</cite>']}
Pandoc citation could be an option as well however I have no experience with that. If it's just to change the \cite{foo}
to @foo
it would be trivial. But I guess it will be more complicated.
And to make pandoc support our html citation would require someone to learn Haskell - again :sunglasses:
however it might be a bit more tricky as the json would consist of 3 cells instead of just 1:
I have a rather generic method to do this in my solution above. The only annoying part would be parsing the HTML tag, I guess we could use lxml like @ellisonbg use's in his PR - https://github.com/ipython/ipython/pull/4090/files#L2R36
have a rather generic method
well, it would need to be made a little bit more generic... :stuck_out_tongue_closed_eyes:
I tested your approach with pandoc 1.9.4.1 and 1.10.1 and both work fine. Actually, my approach is too simple in some cases (equations with spaces :sweat:), so we should stick with yours!
Do you have pandoc-master at hand to test this as well? @jgm posted in jgm/pandoc#954 that his filters are designed to work with 1.12 (master) and won't work properly with 1.9.4.1. So, there might be some changes which could break our approach in future versions of pandoc. Unfortunately, I don't have good experiences with installing pandoc using cabal, I actually haven't succeeded a single time :smile:.
Do you have pandoc-master at hand to test this as well?
I don't, but I'll go ahead and try to install it. :tired_face: I'll report back with my findings
+++ Jonathan Frederic [Sep 16 13 14:16 ]:
Do you have pandoc-master at hand to test this as well?
I don't, but I'll go ahead and try to install it. :tired_face: I'll report back with my findings
Why not just install pandoc 1.12, released yesterday?
Why not just install pandoc 1.12, released yesterday?
Oh nice, I'll go ahead and try that. Thanks @jgm and thank you for making such a nice tool. I use it ALL the time and wouldn't be able to do much without it!
I think it's still a good idea for me to try to install pandoc from source. I've never tried it before and I think it's a healthy exercise.
Just reporting back... I only encountered one small hiccup installing Pandoc from source (my fault). I forgot to prepend sudo
to the cabal install pandoc
(a bunch of dependencies failed, with a colorful variety of error messages). Other than that, everything went very smoothly (following the docs).
The approach I suggested above doesn't seem to work for the latest, Pandoc 1.9.4.2. I'll take a look in greater detail and report back.
It looks like the latest Pandoc is just a little more strict with the spacing between $ symbols and math content (correct me if I'm wrong @jgm). Since we support that space in the notebook, there already exist code to correct it.
Using
**foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq$ <b>b\$ar</b> $$test$$
instead of
**foo** $\left( \sum_{k=1}^n a_k b_k \right)^2 \leq $ <b>b\$ar</b> $$test$$
produces the correct output :smile: woohoo!
@jakobgager the code that automatically fixes the spacing is defined as _strip_mathspace in IPython.nbconvert.filters.latex
Actually, my approach is too simple in some cases (equations with spaces ), so we should stick with yours!
@jakobgager , how about a mix of both of our approaches? That way the raw mathjax latex can still be parsed (since it looks like that is important).
As mentioned in #4251 we should try with some align
or eqnarray
environment (the & is actually important).
With respect to #4234 there should be a decision if that much latex is to be allowed or not. If yes there is clearly a need to mix our approaches.
I added this to the Hackpad for tomorrow's meeting- https://hackpad.com/IPython-dev-meetings-6wTSjJt7TZK
Great idea, however the last point is not correct.
Markdown -> JSON -> Hide Latex Python Filter -> HTML +math -> JSON -> Reveal Latex Python Filter -> Latex
does not drop citations (they are hidden and thus preserved) and the ambersand stuff is a pure citation2latex issue and appears with all approaches that do not drop inline latex!
You can also ask someone from Berkley to talk to John MacFarlane (the genius behind pandoc) in person about including html treatment when converting markdown to latex.
(they are hidden and thus preserved)
Sorry, you are right, I should have clarified. I'm talking about a conversion without the citation filter (I updated the hackpad).
If I do HTML -> JSON -R (or without the -R) with the following HTML
<strong data-cite="smith">Bob</strong>
I get
{u'contents': [{u'contents': u'Bob', u'tag': u'Str'}],
u'tag': u'Strong'}],
Which doesn't have the contents of the data-cite attribute ("smith" is nowhere to be found).
See https://gist.github.com/jdfreder/6734825 , must be opened in the notebook.
It renders the Pandoc and notebook Markdown outputs side by side for comparison. Some of the Pandoc output gets a bit garbled when converting from LaTeX->HTML, but not enough to ruin the results.
I will look at the notebook and if necessary add missing stuff.
add missing stuff.
Yes, please :grinning: Thanks
Well before getting to the bottom of the notebook I think there is one column missing. I would make one column for notebook rendering, one for html rendering (nbviewer like) and one with the resulting tex. This way its easier to get the information.
I really like the style!! :smile:
I would add piped tables which work with both:
compare_render(r"""
|Left |Center |Right|
|:----|:-----:|----:|
|Text1|Text2 |Text3|
""")
Well before getting to the bottom of the notebook I think there is one column missing. I would make one column for notebook rendering, one for html rendering (nbviewer like) and one with the resulting tex. This way its easier to get the information.
The column exists, you just need to run all of the cells :smile:
I know it's kind of weird that you have to run the notebook to actually see the other column. It's an unfortunate side effect of using the notebook's javascript to actually render the notebook markdown. If you look at the ugly notebook_render()
code, you can see why this is
I would add piped tables which work with both
Ah thank you, I didn't know this worked for both let alone that it was different than the ascii tables. I'll go ahead and add it, unless you wanted to PR against my gist? - I don't know if that's possible, I don't use gists often :stuck_out_tongue:
Pandoc strips raw html tags present in markdown cell during the conversion to latex, see http://johnmacfarlane.net/pandoc/README.html#raw-html .
This has been adressed here: ipython/nbconvert#183