Literal block continues too far after bullet

drmikehenry commented 4 years ago

Literal block highlighting extends too far when the block is contained within a bullet, e.g.:

- A bullet with literal following::

    Correctly highlighted literal

  Incorrectly highlighted as literal

Correctly unhighlighted.

As I understand it from https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#bullet-lists, the indentation level for the body of the bullet is determined by the first non-whitespace character after the bullet character (starting at A bullet with above). The subsequent literal block must be indented relative to that, and that literal block should end when the additional indentation is removed (at the line Incorrectly highlighted as literal).

Using the latest vim-restructuredtext, the line Incorrectly highlighted as literal is highlighted as a literal block.

Using the rst2html command from docutils <https://docutils.sourceforge.io/>, the above reStructuredText renders correctly as:

<ul>
<li><p class="first">A bullet with literal following:</p>
<pre class="literal-block">
Correctly highlighted literal
</pre>
<p>Incorrectly highlighted as literal</p>
</li>
</ul>
<p>Correctly unhighlighted.</p>
</div>

Literal block highlighting worked correctly for such cases before https://github.com/marshallward/vim-restructuredtext/commit/796536cdc4089d9276fb9c5cf4573ff302c1e0a4.

marshallward commented 4 years ago

Thanks, this is a longstanding problem with many of the blocks, due to difficulties with counting whitespace in VimL.

This is now resolved for literal blocks, e.g.:

A block

  An indented block::

    A literal block

  Outside the block

but the technique does not get correctly passed through all such structures, including lists, and needs to be ported to many of the other blocks.

Unfortunately this will take some structural rewrites and hasn't happened yet. But thanks for raising this and hopefully some time will be found to return to the problem.

marshallward commented 4 years ago

It looks like the issue is that there is no special handling of lists, which are treated as regular text. So the line is handled as if it is a block beginning at column 0, and the two indents are part of the same block. So it is at least "working" in that sense.

(Edit: to clarify, the following would be correctly highlighted if the list element were removed.)

Not a bullet with literal following::

    Correctly highlighted literal

  Also correctly highlighted as literal

Correctly unhighlighted.

(I am not sure why it was working prior to 796536c but I recall we used to start comparing whitespace at the first indented line, which was incorrect, rather that the line prior to indentation)

If one could start the offset calculation from the first non-blank character after the list token then this should probably work. I don't know how aware one needs to be that they are inside a list element, however. I will keep looking.

drmikehenry commented 4 years ago

Thanks; I appreciate your taking a look at this.

marshallward commented 4 years ago

I made an attempt to fix this, but cannot yet resolve it.

Currently we begin the block by looking for leading spaces, specifically this:

^\z(\s*\)

which lets us save the whitespace into \z1 and later search for this to decide whether the block has ended.

This fails because the - is not being included in the group. While I can add it to the group, e.g.:

^\z(\s*-\?\s*\)

I'm now forced to include this dash into the criterion for terminating the block, which is not what I want.

The closest I got was to use two groups before and after the slash, e.g.:

^\z(\s*\)-\?\z(\s*\)

and then replace the - with a \s when doing the match, e.g.:

\z1\s\z2

which works great for lists, but now requires a redundant space for non-list blocks.

What I really want is some ability to replace, say, \z1 with spaces of the same length. Something naive like \s{len(\z1)} isn't working for me, but this is pushing up to my limit of VimL.

Any suggestions would be appreciated here.

drmikehenry commented 4 years ago

I don't have much experience writing syntax files. I have the start of an idea, but I don't know how it might interact with the rest of the syntax logic. I adjusted the start expression on the rstLiteralBlock as shown below to allow for an additional hyphen as a bullet character (though a complete solution would need to include other bullet characters). Bullet characters must be followed by at least one space. We want to construct an end pattern that exchanges the bullet character for a space, but only when a bullet was present. The idea is to capture the first space after the bullet as \z2 and any additional spaces in \z3. If no bullet is found, both \z2 and \z3 will be empty; otherwise, \z2 will be a single space. Using \z2 twice in a row within end allows us to require two spaces (when a bullet is present) or no spaces (when no bullet is present). Any additional spaces are then matched using \z3:

syn region  rstLiteralBlock         matchgroup=rstDelimiter
      \ start='\(^\z(\s*\)\(-\z( \)\z( *\)\)\?.*\)\@<=::\n\s*\n' skip='^\s*$' end='^\(\z1\z2\z2\z3\s\+\)\@!'
      \ contains=@NoSpell

I've not done much testing with this, since I don't know if it's the right idea.

marshallward commented 4 years ago

Sorry for the very long delay, I finally got around to testing this and it appears to work well. I think the idea of using the subsequent space to replace the token is a very clever idea. It should not be difficult to replace - with the set of list tokens.

If you'd like to submit this then I'm happy to merge it. But otherwise I can make the modification.

marshallward commented 4 years ago

Also I am unsure how to adapt this trick to enumerated lists, since this relies on the token being exactly one space, but this is better than what is currently being used.

marshallward commented 4 years ago

~Ah, one more observation:~

- Unhighlighted bullet with empty literal block::

- Incorrectly highlighted bullet

- Correctly highlighted bullet

~This is a bit of an odd case, because it's an error to produce an empty literal block, but the second bullet is not necessarily wrong, and would be rendered normally. But it would be good to somehow catch this case.~

marshallward commented 4 years ago

Actually, disregard that last issue. It seems that this was already happening with the current syntax file and is unrelated to the problem of literal blocks in list tokens.

drmikehenry commented 4 years ago

I'm happy to leave the actual commit in your hands, once you are happy with the results.

I think I can extend the idea to include enumerated lists. A bullet list item is a single bullet character followed by at least one space. Enumerated lists have a lot more cases, but they all boil down to one or more "digits" (or letters or # or Roman numerals (!) or ...) and one (or more) punctuation-like characters (a period, a right-parenthesis, surrounding parentheses), and crucially at least one space afterward.

The bullet idea works because we know the bullet consumes exactly one character. If we could know a priori how many characters are in an enumeration list item before the single space, we could handle it the same way.

Consider the case of a single-digit number followed by a period:

1. bullet

This has two characters before the single space. We can match on either the bullet case or this one-digit enumeration case as follows:

\(-\z( \)\|\d\.\z( \)\)

Since both branches of the | can't match at the same time, only one of the \z groups will be non-empty. As before for the bullet case, we'd want "bullet+space" to become "two spaces", so we'd double that first \z group. For the enumeration case, we'd want "digit+period+space" to become "three spaces", so we'd triple that \z group. We can treat the two-digit enumeration as yet another separate case. An enumeration could conceptually take arbitrarily many digits, but in practical terms it's typically one or two digits, and beyond four digits seems pretty unlikely.

In essence, this divides the cases into equivalence classes based on the number of non-space characters are found before the single space. In the example below, I'm handling only the hyphen as a bullet character and the case of "digits+period" for enumerations, but if this idea appeals to you, the other cases could be added. For a given number of non-space characters, only one \z group is needed to capture the following space. This is good news, since there are only nine \z groups available. So in the table below, we can group together all things with three non-space characters and consume only a single \z group.

Examples of n characters before the single space:

1-char: -
2-char: 1., 1), -#
3-char: 10., 10), (1)

Here is an example that works for the hyphen-bullet and anything up to four digits followed by a period. It is quite possibly the most awful-looking Vimscript I've ever written. It deserves a page of comments providing explanation and/or apology :-)

syn region  rstLiteralBlock         matchgroup=rstDelimiter
      \ start='\(^\z(\s*\)\(-\z( \)\|\d\.\z( \)\|\d\d\.\z( \)\|\d\d\d\.\z( \)\|\d\d\d\d\.\z( \)\)\?\z( *\).*\)\@<=::\n\s*\n'
      \ skip='^\s*$' 
      \ end='^\(\z1\z2\z2\z3\z3\z3\z4\z4\z4\z4\z5\z5\z5\z5\z5\z6\z6\z6\z6\z6\z6\z7\s\+\)\@!'
      \ contains=@NoSpell

By an interesting coincidence, the \z group numbers happen to equal the number of times that group is repeated in the end pattern (for \z2 through \z6). As before, \z1 matches the overall leading whitespace, and now \z7 matches any excess whitespace following the required single space captured in \z2 through \z6 for each of cases.

marshallward commented 4 years ago

This is incredibly impressive, and (I think?) I can see how it would work, but it is pretty complex and might be too much work for what it is trying to achieve. If there is some simple way of identifying and grabbing the length of the enumerator, then it ought to be enough to manage the use case.

drmikehenry commented 4 years ago

Yes, in its current form it can make you cross-eyed to look at it. The source could be tidied up a good bit with some helper variables or functions to construct the regex, and as mentioned it would need some commenting. I think it's likely to work reasonably well (meaning I hope it's not overly slow to execute, and it fixes the erroneous highlighting for the practical cases). Unfortunately I have no idea how to build an end pattern based on the length of a match in the start pattern, other than this present technique. Maybe you could ask on the Vim mailing list for a better suggestion; I'm currently stumped for anything better.

marshallward / vim-restructuredtext

Literal block continues too far after bullet #56