jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.61k stars 3.38k forks source link

RST reader: problem parsing directive with uneven indentation #5753

Closed raffaem closed 5 years ago

raffaem commented 5 years ago
pandoc -s -o 4_word_embeddings_tutorial.pdf 4_word_embeddings_tutorial.rst
[WARNING] Reference not found for 'q' at chunk line 1 column 7
[WARNING] Reference not found for 'q' at chunk line 1 column 32
Error producing PDF.
! Missing \right. inserted.
<inserted text>
                \right .
l.188 ... \left[ \overbrace{2.3}^\text{can run},\]

I attach the .rst file renamed as txt since github does not support rst uploads

4_word_embeddings_tutorial.rst.txt

mb21 commented 5 years ago

you seem to have interesting math in there... like:

.. math::

    q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]

Where is the q command supposed to come from?

jgm commented 5 years ago

Running pandoc -t latex -f rst on the bit @mb21 quotes, we get:

\[q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},\]

\begin{quote}
overbrace\{9.4\}\^{}text\{likes coffee\},
overbrace\{-5.5\}\^{}text\{majored in Physics\}, dots right{]}
\end{quote}

So the second line is being interpreted as a block quote, not part of the math directive. Why? Because it's not indented to the same level as the beginning of the math q_, but one space less.

It may be that pandoc isn't correctly interpreting RST here, in which case this is an RST reader bug with parsing of directives.

jgm commented 5 years ago

Here's what the docutils documentation says:

An explicit markup block is a text block:

  • whose first line begins with ".." followed by whitespace (the "explicit markup start"),
  • whose second and subsequent lines (if any) are indented relative to the first, and which ends before an unindented line.

Explicit markup blocks are analogous to bullet list items, with ".." as the bullet. The text on the lines immediately after the explicit markup start determines the indentation of the block body. The maximum common indentation is always removed from the second and subsequent lines of the block body. Therefore if the first construct fits in one line, and the indentation of the first and second constructs should differ, the first construct should not begin on the same line as the explicit markup start.

jgm commented 5 years ago

So I guess the .. math block isn't supposed to end until we find a line that starts at or before the column occupied by the .. (here the first column).

I was thrown off by the remark "The text on the lines immediately after the explicit markup start determines the indentation of the block body." But it appears this only affects the stripping of indentation from subsequent lines, and doesn't help define the indentation level needed to end the block.

This shouldn't be hard to fix in the RST reader, but in the mean time, evening up your indentation in your source will allow pandoc to deal with it.

jgm commented 5 years ago

The warnings about q are no doubt related to this (parsing of math is being cut off early).

raffaem commented 5 years ago

I am so so sorry for not saying it before: this is taken from here: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py