Closed ndmitchell closed 10 years ago
I think you need to set msXssProtect
to False
to get that to work.
With msXssProtect
turned off I get <pre>\na<p>b\n</pre></p>
- so it removes a redundant <pre></pre>
, but is still wrong. Why is msXssProtect
inserting redundant and incorrect <pre>
's?
Hmm, not sure, I'll need to look into this tomorrow. Thanks for the report.
I'm actually not sure if the spec is clear on this one way or another. The spec says:
The only restrictions are that block-level HTML elements — e.g.
,,
,, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces.
What you have here is a blank line inside a literal HTML block. The issue I'm having is that there is ambiguity about how to deal with nesting of HTML tags, and more importantly lack of well-balanced tags. For example, what should be the output of:
hi <pre> hello <pre>inner pre</pre> I closed off that `pre` tag with: </pre>.
It's unclear what to do about the nested
pre
tags inside. Any change I make on this could potentially break existing documents, so unless there's a clear spec on the right way to handle these kinds of situations, I don't think I can make that change.snoyberg commented 10 years agoInteresting case-in-point in the previous comment. My example was meant to look like:
hi <pre> hello <pre>inner pre</pre> I closed off that `pre` tag with: </pre>.
ndmitchell commented 10 years agoThe current HTML doesn't nest properly, which seems like a bug unless the input has non-matching tags (my input all nests). Perhaps you should have a test that for any random string the HTML nests properly?
I checked the rendering with Github, StackOverflow and MarkdownPad 2 (my markdown editor) - all of them handle the
pre
the way I was expecting. Importantly, if you don't treatpre
the way I was expecting, I can't see any obvious way of getting the effect I want, which is:line 1 line 2 with italicssnoyberg commented 10 years agoThe lack of balancing might be a bug in xss-sanitize. That certainly seems to be where that specific bug is coming from.
So it seems like the behavior you're seeing and expecting is:
- If a line starts with a block-level HTML tag, begin "HTML mode".
- Continue in HTML, check of nestings of that original tag, and stop as soon as a close tag is found for that tag.
- Completely ignore any issues of trailing whitespace for the HTML block.
Is that accurate?
snoyberg commented 10 years agoI pushed a commit to https://github.com/yesodweb/haskell-xss-sanitize which should fix the unbalanced tags issue, can you give that a try?
ndmitchell commented 10 years agoWith msXssProtect=True I get:
<pre> a<p>b <pre></pre></p></pre>
With msXssProtect=False I get:
<pre> a<p>b </pre></p>
So with protect on the tags now nest, which is good. But the original output from markdown doesn't nest, which seems wrong - so it seems markdown feeds bad data to sanitise. Your description of the parse rules seems reasonable - but I'm no expert - matching GitHub/StackOverflow as far as possible seems a reasonable approach.
As a side note, for my project, having XssProtect is not necessary - I'm happy to have it either on or off. I'm writing a website generator - see https://github.com/ndmitchell/shake/blob/master/website/Main.hs if you are curious.
snoyberg commented 10 years agoAccording to the current rules used by markdown (blank line terminates an HTML block), the input is non-nested HTML, which is where the problem comes from. Changing to the modified parse rules would fix this by making the original input valid, nested HTML, since the blank line in between would be treated as raw HTML as well.
I'll try to look into this next week.
snoyberg commented 10 years agoI'm guessing that we should actually follow the new spec on markdown on this. However, I think this is the opposite of what you want @ndmitchell. We should probably discuss this while we're still in the same building. :)
ndmitchell commented 10 years agoFollow the standard and I will adapt my code. As long as you match GitHub that seems reasonable. However, this I ssomewhere GitHub disagrees with the standard - worth pointing that out to the standards guys?
snoyberg commented 10 years agoOK, I've pushed some changes to make HTML block treatment in line with the spec. I would assume the spec authors are aware that the spec contradicts what Github is doing right now, but it's worth pointing out.
Regarding your desired use case, the spec says:
Moreover, blank lines are usually not necessary and can be deleted. The exception is inside
tags; here, one can replace the blank lines with entities.Which may not be what you were hoping for, but does seem to imply they considered this case when designing it, so I'd guess they're unlikely to want to change their decision here.
If this change works for you, I'll release it to Hackage.
ndmitchell commented 10 years agoI have nothing public that uses it, so push away. It will probably be a few weeks before I get back to the project I was working on that uses it - assuming you've added some test cases for my examples above (so there behaviour is fixed and won't break in future) I suggest we close this bug.
snoyberg commented 10 years agoI think there are enough test cases: I copied 11 examples out of the spec and turned them into examples, which all now passed. New version is now on Hackage.
ndmitchell commented 10 years agoMany thanks!
I would have expected
<pre>\na\n\nb\n</pre>
.