snoyberg / markdown

Convert Markdown to HTML, with XSS protection
BSD 3-Clause "New" or "Revised" License
69 stars 401 forks source link

Pre blocks not treated properly #23

Closed ndmitchell closed 10 years ago

ndmitchell commented 10 years ago
>>> renderHtml $ markdown def "<pre>\na\n\nb\n</pre>"
"<pre>\na<p>b\n<pre></pre></p></pre>"

I would have expected <pre>\na\n\nb\n</pre>.

snoyberg commented 10 years ago

I think you need to set msXssProtect to False to get that to work.

ndmitchell commented 10 years ago

With msXssProtect turned off I get <pre>\na<p>b\n</pre></p> - so it removes a redundant <pre></pre>, but is still wrong. Why is msXssProtect inserting redundant and incorrect <pre>'s?

snoyberg commented 10 years ago

Hmm, not sure, I'll need to look into this tomorrow. Thanks for the report.

snoyberg commented 10 years ago

I'm actually not sure if the spec is clear on this one way or another. The spec says:

The only restrictions are that block-level HTML elements — e.g.

, ,
, 

, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces.

What you have here is a blank line inside a literal HTML block. The issue I'm having is that there is ambiguity about how to deal with nesting of HTML tags, and more importantly lack of well-balanced tags. For example, what should be the output of:

hi

<pre>
hello
<pre>inner pre</pre>

I closed off that `pre` tag with: </pre>.

It's unclear what to do about the nested pre tags inside. Any change I make on this could potentially break existing documents, so unless there's a clear spec on the right way to handle these kinds of situations, I don't think I can make that change.

snoyberg commented 10 years ago

Interesting case-in-point in the previous comment. My example was meant to look like:

hi

<pre>
hello
<pre>inner pre</pre>

I closed off that `pre` tag with: </pre>.
ndmitchell commented 10 years ago

The current HTML doesn't nest properly, which seems like a bug unless the input has non-matching tags (my input all nests). Perhaps you should have a test that for any random string the HTML nests properly?

I checked the rendering with Github, StackOverflow and MarkdownPad 2 (my markdown editor) - all of them handle the pre the way I was expecting. Importantly, if you don't treat pre the way I was expecting, I can't see any obvious way of getting the effect I want, which is:

line 1
line 2 with italics
snoyberg commented 10 years ago

The lack of balancing might be a bug in xss-sanitize. That certainly seems to be where that specific bug is coming from.

So it seems like the behavior you're seeing and expecting is:

  • If a line starts with a block-level HTML tag, begin "HTML mode".
  • Continue in HTML, check of nestings of that original tag, and stop as soon as a close tag is found for that tag.
  • Completely ignore any issues of trailing whitespace for the HTML block.

Is that accurate?

snoyberg commented 10 years ago

I pushed a commit to https://github.com/yesodweb/haskell-xss-sanitize which should fix the unbalanced tags issue, can you give that a try?

ndmitchell commented 10 years ago

With msXssProtect=True I get:

<pre>
a<p>b
<pre></pre></p></pre>

With msXssProtect=False I get:

<pre>
a<p>b
</pre></p>

So with protect on the tags now nest, which is good. But the original output from markdown doesn't nest, which seems wrong - so it seems markdown feeds bad data to sanitise. Your description of the parse rules seems reasonable - but I'm no expert - matching GitHub/StackOverflow as far as possible seems a reasonable approach.

As a side note, for my project, having XssProtect is not necessary - I'm happy to have it either on or off. I'm writing a website generator - see https://github.com/ndmitchell/shake/blob/master/website/Main.hs if you are curious.

snoyberg commented 10 years ago

According to the current rules used by markdown (blank line terminates an HTML block), the input is non-nested HTML, which is where the problem comes from. Changing to the modified parse rules would fix this by making the original input valid, nested HTML, since the blank line in between would be treated as raw HTML as well.

I'll try to look into this next week.

snoyberg commented 10 years ago

I'm guessing that we should actually follow the new spec on markdown on this. However, I think this is the opposite of what you want @ndmitchell. We should probably discuss this while we're still in the same building. :)

ndmitchell commented 10 years ago

Follow the standard and I will adapt my code. As long as you match GitHub that seems reasonable. However, this I ssomewhere GitHub disagrees with the standard - worth pointing that out to the standards guys?

snoyberg commented 10 years ago

OK, I've pushed some changes to make HTML block treatment in line with the spec. I would assume the spec authors are aware that the spec contradicts what Github is doing right now, but it's worth pointing out.

Regarding your desired use case, the spec says:

Moreover, blank lines are usually not necessary and can be deleted. The exception is inside

 tags; here, one can replace the blank lines with 
 entities.

Which may not be what you were hoping for, but does seem to imply they considered this case when designing it, so I'd guess they're unlikely to want to change their decision here.

If this change works for you, I'll release it to Hackage.

ndmitchell commented 10 years ago

I have nothing public that uses it, so push away. It will probably be a few weeks before I get back to the project I was working on that uses it - assuming you've added some test cases for my examples above (so there behaviour is fixed and won't break in future) I suggest we close this bug.

snoyberg commented 10 years ago

I think there are enough test cases: I copied 11 examples out of the spec and turned them into examples, which all now passed. New version is now on Hackage.

ndmitchell commented 10 years ago

Many thanks!