ryneeverett / python-markdown-comments

A Python-Markdown extension to ignore html comments opened by three dashes.
10 stars 6 forks source link

Comments in code lines/blocks #2

Closed snoozbuster closed 9 years ago

snoozbuster commented 9 years ago

It would be nice to be able to talk about the comments this tool provides inside of code blocks. I'm pretty sure that with a simple modification to the regex you should be able to support this.

Change the regex in _uncommented from .*<!---.*---> to [^( )]<!---.--->, and similarly for the multiline start comment (note that there is supposed to be exactly 4 spaces inside the parens). The idea is that^is "not", and then ``( ) is "tilde and at least four spaces" (again, supposed to be exactly 4 spaces in the parens), so the two common ways of starting code blocks. Making these changes myself, the input

`<!--- test --->`
` <!---- test -->` testing quote-single-space

some text for testing `<!--- test -->` more text

inline testing <!--- test --> and more testing

I thought this would fail but it doesn't      <!--- test --> more text

 <!--- testing single space --> this works too, hurray

    <!--- test --> testing code blocks
        <!--- test --> testing 8 spaces
     <!--- test --> testing 5 spaces

produces the output:

<!--- test -->

<!--- test --> testing quote-single-space

some text for testing <!--- test --> more text

inline testing and more testing

I thought this would fail but it doesn't more text

this works too, hurray

<!--- test --> testing code blocks
    <!--- test --> testing 8 spaces
 <!--- test --> testing 5 spaces

So it seems to work even in cases where I thought it wouldn't; I'm not really sure why that is to be honest. I didn't try running the test cases with this modification, but I think if you tried they would work fine. I don't particularly feel like cloning and making the pull request, or testing the multiline comments, but I think this would be a good addition to the plugin.

ryneeverett commented 9 years ago

Thanks for the suggestion. I haven't tested it out yet, but here's some initial thoughts. (All snippets are untested/pseudocode.)


A

^ is "not"

I'm pretty sure caret matches the start of the line. https://docs.python.org/3.4/library/re.html#regular-expression-syntax


B

`( ) is "tilde and at least four spaces" (again, supposed to be exactly 4 spaces in the parens), so the two common ways of starting code blocks.

According to markdown spec:

To produce a code block in Markdown, simply indent every line of the block by at least 4 spaces or 1 tab.

So we should match a tab or spaces (something like ^( {4}|\t)).


C

I'm not sure what your brackets are doing. I thought those were only for ranges (i.e. [a-z]). I think we'll need something more like (?<!...) (negative lookbehind).


D

In order to make sure we're not in an inline code block, we'll need to make sure there aren't an odd number of `'s prior to the match. Something like

(.*`.*`).*`.*

E

Putting this together, I would expect a regex something more like

(?<!((.*`.*`).*`|^( {4}|\t))).*<!---.*--->`
snoozbuster commented 9 years ago

Regarding ^ as start-of-line or not, as well as regarding [] for ranges, see the section on brackets in the python regex docs you linked; in particular:

Used to indicate a set of characters. In a set: ... Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

The brackets are just a way to specify a union of characters without making a capture group like () does. The fact that ranges work in them is also a property of sets.

Regarding odd/even backticks: The Markdown spec doesn't really specify what you should do with multiple backticks right next to each other, like ``. (I'm interested in seeing how Github parses it. EDIT: it leaves them as is, you don't have to escape two backticks next to each other, they just show up) It does specify that you can use "multiple" backticks and optionally space-pad the inner content with exactly one space on either side:

To include a literal backtick character within a code span, you can use multiple backticks as the opening and closing delimiters:

``There is a literal backtick (`) here.``
`` This produces the same thing with the backtick (`) ``

So if we were interested in being thorough, we would have to check for that case. That's why I have the second test case with the backtick-space and then the comment; it seems like the original regex handles it fine, although it's unclear to me why that is.

I had written a lot more stuff but then I realized that inside sets, special characters generally lose their meaning other than certain uses of - and ^. I didn't know this, which makes me wonder why my regex works at all (it did and does work in the cases I tested, although I'm certain it errs on the side of leaving comments there when it should remove them). Negative lookbehind sounds good, but according to the python docs it cannot (should not?) contain strings that aren't fixed-length:

The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not.

So something like

(?<!`|\t| {4})

might work? It wouldn't work in all cases, but it would in the most usual... Perhaps you could combine it with a start-of-line plus some negative lookahead to handle that better? The trick would be that it would probably match the wrong portion of the line, ie, the part you wanted to keep.

Regular expressions are tricky!

It looks like my original says "match any number of anything before this except parenthesis, spaces, and backticks". So you could change it to [^ \t]*` and it would be a little bit more accurate. I think the reason the original one works is because of how the regex decides to do matching (greedy or non-greedy)? It's subtle and pretty coincidental for sure, and not as trivial as I hoped, at least to handle all cases.

ryneeverett commented 9 years ago

Sounds right all around. Your regex-foo is certainly sharper than mine.

Regarding ^ as start-of-line or not, as well as regarding [] for ranges...

Cool. I get it now.

Regarding odd/even backticks...

Yeah, that's tricky and the markdown spec is rather vague. Of course, the only relevant thing to test would be python-markdown itself, which generally relies upon php-markdown as the reference implementation.

But your implementation is probably fine for 99% of use cases. (There are so many use cases for talking about python-markdown-comments, haha).

Negative lookbehind sounds good, but according to the python docs it cannot (should not?) contain strings that aren't fixed-length

You're right. I've been enjoying grep --perl-regexp too much. Perl doesn't have that limitation.

I'm not sure if there's any way to do the capture in python with a single regex, but we shouldn't even need a look-behind for the match. We could do something like:

[^^\t(^ {4})((.*`.*`)*.*`)].*<!---.*-->

Of course, that still doesn't do anything to address the corner cases of backtick literals/escaping, but it's a proof of concept.

Another consideration is the performance implications of lengthening the regex's. If timeit shows a significant slowdown, then perhaps this should be a feature enabled by a meta=True setting. Alongside the overhead of python-markdown itself though, there's a real possibility it will be undetectable.

snoozbuster commented 9 years ago

The trouble with [^^\t(^ {4})((.*.`).`)].` is that inside the brackets, all of those characters just mean the literal characters.

ryneeverett commented 9 years ago

Ah.

It also occurred to me that .* is will eat tilde's, so I'm thinking those would have to be [^]*`.

So maybe:

((?=[^^\t])(?=[^^ {4}])(?=([^`]*`[^`]*`)*[^`]*`).*<!---.*-->

So far, we've been checking whether the line meets the comment conditions:

Another option is to check whether it meets the not-a-comment conditions:

I'd think we could do this with a negative lookahead. So we'd expect our negative lookahead to return True if our line is:

^(?!([^\t]|[^ {4}]|([^`]*`[^`]*`)*[^`]*`)|.*<!---.*-->)
ryneeverett commented 9 years ago

Oh wait,

([^`]*`[^`]*`)*[^`]*`)

is still matching positive. I'm not sure how that could be negative matched.

Maybe the only way to do this is to check the opposite with a positive lookahead (is a code block/line) and negative lookahead (not a comment match).

((?=(^([\t]|[ {4}])|(?=([^`]*`[^`]*`)*[^`]*`).*(?!<!---.*-->)
ryneeverett commented 9 years ago

Every time I look at this it smells different.

((?!=(^([\t]|[ {4}])|([^`]*`[^`]*`)*[^`]*`).*<!---.*-->

The first part uses non-consuming negative look-ahead to return False if any of the conditions of being a code block/line are met. The second should return True only if it is a markdown comment.

snoozbuster commented 9 years ago

Test it out in one of those online regex matchers. Or, alternatively, I'll give it a shot when I go in to work in about an hour.

ryneeverett commented 9 years ago

It occurred to me that it might not be desirable to allow meta-comments in code blocks. It seems more plausible that one would want an actual comment in a code block than a meta-comment and I'm not sure how that ambiguity could be resolved. It might make sense to only support inline meta-comments.

snoozbuster commented 9 years ago

I suppose, yeah. The only use case I can think of for wanting comments to appear in code blocks is if you're writing about Markdown code in general. My reasoning was that I wanted to be consistent; it seemed reasonable that if you put a meta-comment anywhere that would generate code, it would be interpreted as code as opposed to a comment.

snoozbuster commented 9 years ago

Your regex is missing a couple parenthesis, but I don't think it works. If there's any ```s in front of a comment, all the ones prior don't match I think we need negative look-behind instead, maybe something like... Oh, you know what?

Why don't we just say that if you put a \ in front of it, it doesn't get converted? Then the regex is just

(?<!\\)<!---.*-->

And we don't have to guess what the user wants.

I think the code would have to be changed slightly to replace the \<!--- with <!---, that would be the only change.

I think if we wanted to do it the other way we would need multiple regex's. I don't think the conditions we're searching for form a regular language, I think they're context-free.

snoozbuster commented 9 years ago

This should work:

if re.match(r'(?<!\\)<!---.*--->', line):
    return ...
elif re.match(r'(?<!\\)<!--', line):
     return ...
elif re.match(r'\\<!---', line):
    return [re.sub(r'\\<!---', '<!---', line), False]
else:
    return ...

I can't seem to test it because I think it's still using the old version... I'm not exactly clear on where I'm supposed to install this plugin. I eventually gave up and dropped it into the markdown.extensions package, but now there's no .pyc for it now so I have no idea where it's compiling to.

ryneeverett commented 9 years ago

Sorry, I haven't actually tried any of those they were just proofs of concept.

Regarding your suggested escape syntax, aren't you just creating a meta-meta-comment? How would one write about meta-meta-comments? At some point we have to arbitrarily choose among the ambiguities. I don't think there's any way to avoid that.

Regarding installation, the most reasonable method I'm aware of is to just copy the file into your project. I would not try to patch it into python-markdown.

ryneeverett commented 9 years ago

I just added an __init__.py file, so another option would be to include the repo as a git subtree, assuming you're using git.

snoozbuster commented 9 years ago

Well, I'm using Markdoc, which doesn't indicate at all where you should put extensions for it to use. I momentarily forgot this is an extension for python-markdown in general; never mind me.

Yeah, I suppose I am just creating meta-meta comments in that case. Reminds me of this XKCD...

ryneeverett commented 9 years ago

So my plan was to tell you that the way to do meta comments is simply to make those code lines/blocks html, which isn't supposed to be parsed by markdown. So I tested that out and it turns out it doesn't work -- those comments still get removed. That's a bug.

ryneeverett commented 9 years ago

The problem is that mkdcomments is a pre-processor. Ideally it would probably include both inline- and block-parsers. However it might be easier to simply make it a post-processor and remove any <p> elements created by mutli-line comments.

https://pythonhosted.org/Markdown/extensions/api.html

snoozbuster commented 9 years ago

If you just made them HTML, wouldn't it just turn into an HTML comment?

snoozbuster commented 9 years ago

Oops.

ryneeverett commented 9 years ago

I'm talking about inline html.

Inline

<!--- test ---> <!---- test --> testing quote-single-space

some text for testing more text

Expected:

<\p><\/p> <\p> testing quote-single-space<\/p>

<\p>some text for testing more text</\p>

Block

<pre>
<!--- test --> testing code blocks
<!--- test --> testing 8 spaces
<!--- test --> testing 5 spaces
</pre>

Should return the same string.

snoozbuster commented 9 years ago

Oh, right, okay. Wouldn't the < and > have to be escaped inside the <code> and <pre> blocks, though?

ryneeverett commented 9 years ago

Python-markdown replaces html blocks with placeholders early in the process and puts them back late in the process. So all it would take to fix this would be having the comment regex run while code blocks are not present. (Instead of before the process, which is what currently happens.)

ryneeverett commented 9 years ago

So I forgot that the html is replaced with placeholders by a preprocessor, so making markdown-comments run immediately afterwards would be as simple as:

md.preprocessors.add("comments", CommentsProcessor(md), ">html_block")

Unfortunately, since markdown-comments are valid html comments, they would all be replaced with placeholders by markdown and our regex's would find nothing.


The only solution that comes to mind requires three parsers:

1) Preprocessor before html_block that searches for markdown-comments and modifies them so they're not valid html. 2) Preprocessor after html_block that removes any markdown-comments found. (The ones in html will have been replaced with placeholders by markdown.) 3) Postprocessor after raw_html (which replaces placeholders with raw html) that modifies the markdown comments in raw html back to their original state before they were munged by the first preprocessor.

snoozbuster commented 9 years ago

This is turning out to be a much larger project than I initially anticipated.

ryneeverett commented 9 years ago

I think that strategy worked. My description of inline html in markdown above was incorrect. From the spec:

Unlike block-level HTML tags, Markdown syntax is processed within span-level tags.

snoozbuster commented 9 years ago

Sweet.

snoozbuster commented 9 years ago
$ python
>>> import markdown, mkdcomments
>>> md = markdown.Markdown(extensions=["mkdcomments:CommentsExtension"])
>>> md.convert("some text `<!--- inline comment -->` more text example `<pre>` tag")
u'<p>some text <code>` more text example</code><pre>` tag</p>'

To be fair, this isn't what the comments were designed to handle, and I figured out what Markdown is seeing, but that's still not at all what I'd expect to see from that.

Additionally, after some in-browser testing, although your test you added does pass, it doesn't do what you'd expect. HTML is still processed inside a <pre> block, so the comment is inserted into the HTML and promptly stripped by your browser. I think the RawCommentReplacer needs to escape the opening (and probably closing) angle bracket. The test should probably be changed to

def test_comments_in_html(self): 
     """ See issue #2. """ 
     md_input = """\ 
         <pre> 
             <!--- test --> testing code blocks 
                 <!--- test --> testing 8 spaces 
              <!--- test --> testing 5 spaces 
             <!--- testing multiline comments
                    more
                    more
               -->
         </pre>""" 
    md_output = """\ 
         <pre> 
             &lt;!--- test --&gt; testing code blocks 
                 &lt;!--- test --&gt; testing 8 spaces 
              &lt;!--- test --&gt testing 5 spaces 
             &lt;!--- testing multiline comments
                  more
                  more
              --&gt;
         </pre>"""
     self.assertExpectedMarkdown(md_input, md_output) 

as well (but making sure the number of spaces on the multiline stuff is right).

ryneeverett commented 9 years ago

that's still not at all what I'd expect to see from that.

I'm not sure what the output you'd expect to see is, but I think this extension is doing the right thing. From the spec:

Unlike block-level HTML tags, Markdown syntax is processed within span-level tags.

Therefore it makes sense to me that:

>>> assert (
...     markdowner.convert("some text `<!--- inline comment -->` more text example `<pre>` tag")
...     ==
...     markdowner.convert("some text `` more text example `<pre>` tag")
... )
ryneeverett commented 9 years ago

Additionally, after some in-browser testing, although your test you added does pass, it doesn't do what you'd expect. HTML is still processed inside a <pre> block, so the comment is inserted into the HTML and promptly stripped by your browser. I think the RawCommentReplacer needs to escape the opening (and probably closing) angle bracket.

Correct, the comments will not render because they are valid html comments. And yes, the brackets would need to be escaped in order for them to be rendered.

The question is whether that's the responsibility of this extension or the user. I would be inclined to defer to what python-markdown does in this situation. That is, if you wanted to display regular html comments in a <pre> block you'd have to escape them as well, which markdown does not. Eg.:

<pre>
        <!-- test --> testing code blocks
                <!-- test --> testing 8 spaces
         <!-- test --> testing 5 spaces
</pre>

vs.

<pre>
        &lt;!-- test --&gt; testing code blocks
                &lt;!-- test --&gt; testing 8 spaces
         &lt;!-- test --&gt; testing 5 spaces
</pre>
snoozbuster commented 9 years ago

I suppose that's fair enough. I had written a page and my comment wasn't showing up in my <pre> block and it was very confusing, and then I remembered that it was turning into an HTML comment. It just seems like that defeats the point of not removing them in the source, because if I wanted to talk about the comments I could have written

<pre>
&lt;!--- A comment --&gt;
</pre>

without any of this work.

ryneeverett commented 9 years ago

Correct, the changes don't really affect any real-world use cases. But the extension no longer violates the contractual promise of the markdown spec not to meddle with the contents of html blocks and I consider that a worthy improvement.