mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
776 stars 146 forks source link

Feature request: safe mode #19

Open DemiMarie opened 7 years ago

DemiMarie commented 7 years ago

This is a feature request for a “safe mode”: potentially-malicious content (such as certain URL schemes) is disallowed to prevent XSS attacks.

The output from this should be safely insertable into a webpage, without any further escaping or sanitization.

mity commented 7 years ago

Ack. This is needed.

The open question is what exactly it should do.

My current blurry idea includes these points:

Opinions? Other ideas?

DemiMarie commented 7 years ago

It would be nice to be able to whitelist some HTML tags, like Snudown (Reddit’s non-conformant parser) can. But that would be hard. On the other hand, MD4C already must be able to parse out a tag for inline HTML, and in safe mode we can simply disallow anything that isn’t strictly valid. (Checking for basic syntax correctness isn’t that hard. The hard part in parsing HTML is handling invalid markup. But we can simply reject invalid markup.)

mity commented 7 years ago

Whitelisting inline (correctly formed) tags is fine.

But the raw HTML block with its start and end conditions is quite different story.

So to be sure, you propose to treat the raw HTML blocks in the safe mode differently and recognize them only if they are a sequence of complete and valid (whitelisted) tags. I.e. treat them almost the same way as paragraph composed of nothing but raw HTML inlines and the only difference would be the block is not enclosed in <p>. Right?

I'll think about it.

DemiMarie commented 7 years ago

@mity I was thinking of ignoring HTML blocks entirely, which MD4C can already do. Much simpler and less prone to security vulnerabilities, both memory safety violations and whitelist validation failures. And HTML blocks are not needed for most applications that want a safe mode.

mity commented 7 years ago

Ugh. That means that this:

<div>

Paragraph 1.

Paragraph 2.

</div>

would be translated to this:

<p><div></p>

<p>Paragraph 1.</p>

<p>Paragraph 2.</p>

<p></div></p>
DemiMarie commented 7 years ago

Yeah, that's not good. Fortunately we can white list that kind of HTML block.

On Jul 14, 2017 7:46 PM, "Martin Mitáš" notifications@github.com wrote:

Ugh. That means that this:

Paragraph 1. Paragraph 2.

would be translated to this:

Paragraph 1.

Paragraph 2.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mity/md4c/issues/19#issuecomment-315492259, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB4hUkB52AMRzWDxDKuGm8oK8SYeuks5sN_3BgaJpZM4OYdD9 .

mity commented 7 years ago

I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-)

The open problems:

DemiMarie commented 7 years ago

I disagree. There will certainly be safe cases that are rejected, but detecting an HTML tag in "canonical form" (that that serializers emit) is not any harder than recognizing an XML tag. In fact, one can do both(!) with regular expressions. The difficulties appear when one needs to handle HTML that is invalid, or that isn't in canonical form.

That said, a proper HTML parser is necessary if one wants to ensure that markup is valid, unless one wants to implement most of an XML parser. For many use-cases, this isn't a problem.

On Jul 22, 2017 9:54 PM, "Craig Barnes" notifications@github.com wrote:

It seems to me like there are only 2 truly "safe" ways to handle inline HTML.

  1. Disallow HTML tags entirely and escape angle brackets as < and >.
  2. Allow HTML as usual and post-process the output through a proper HTML sanitizer.

The number of ways to trip up anything short of a full HTML5 parser are too many.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mity/md4c/issues/19#issuecomment-317222812, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB5tBhB_foMijg643h5Q_8I5M2yltks5sQqftgaJpZM4OYdD9 .

DemiMarie commented 7 years ago

We definitely don't want to allow comments – IE executes scripts in conditional comments. I think it is best to strip them out.

Processing instructions and CDATA sections should just be unrecognized. I don't understand why CommonMark recognizes them; they are parse errors in HTML and thus rather useless. I think that the only HTML that should be allowed in safe mode should be complete, whitelisted, and well-formed tags. Anything else should not be allowed.

On Jul 18, 2017 5:34 PM, "Martin Mitáš" notifications@github.com wrote:

I tried to implement something as discussed above. It is very incomplete but I already can see this approach has some big problems. So I consider it to be just a proof of concept which shows its severe limitations and that real solution needs more worked then I thought. (As always ;-)

The open problems:

-

Raw HTML blocks: Currently not handled at all.

filter_url_scheme() callback: Is it the right way? Maybe the app might want to see whole link destination and not just the scheme. (Consider e.g. relative links without any scheme. App might still want to forbid them if they start e.g. with .. or '/'). That way, the interface could be unified with the attribute callback (see below.) On the flip side, app then must parse the scheme on its own if that's what it wants to filter.

Tag attributes: Should not be there also filter_raw_html_tag_attribute( )? Consider e.g. all those onclick, onload and similar HTML attributes.

Markdown construct versus semantically equivalent raw HTML: Consider e.g. innocent text versus <a href="dangerous_url">innocent text. Should not this be somehow unified in the filtering interface? Maybe, for the filtering purposes, the Markdown construct should be translated to raw html tag/attribute firthe purposes of the filtering, so app does not have a chance to screw it up by forbidding one and allowing the 2nd or vice versa?

HTML comments or CDATA;? Do we want to support them as well in the safe mode? I tend to say no and to keep it relatively simple.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mity/md4c/issues/19#issuecomment-316204868, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB-n2Eel6U_-QMepwm_Ann1iT2HRyks5sPST2gaJpZM4OYdD9 .

craigbarnes commented 7 years ago

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">
mity commented 7 years ago

Whitelisting specific elements isn't sufficient to stop scripting attacks. Attributes also need to be filtered:

<div style="background: red; width: 100%; height: 100%" onmouseover="alert('pwned')">

Yeah. I have realized that too (see the comment with open comments). I am in process of thinking how the interface of the filtering should look like.

mity commented 7 years ago

Right now, I tend to discard the attempt above and start again.

As said, I am mostly wondering how the interface should look like. I am especially interested in a feedback for the following questions:

craigbarnes commented 7 years ago

Is it enough to call a tag callback and then (n-times) attribute callback (for each attribute)? Or are there cases where the filter needs to know all tags attributes (and their values) at the same time?

The only cases where you need to know the element name, attribute name and attribute value at the same time are a[href], area[href] and img[src].

The large majority of attributes suitable for whitelisting are global attributes (applicable/safe for all elements), where the value is implicitly safe (doesn't need to be known by the filter, doesn't depend on the presence of other attributes).

I've implemented a minimal HTML5 DOM sanitizer in Lua here. The safeness of the code is dependant on the correctness of the underlying HTML5 parser, but the set of whitelisting rules should be similar to what you might want to use here.

If a single attribute is refused, should just be that single attribute be skipped, or should the whole tag be downgraded from the "html tag" status and be treated as a normal text?

Just that attribute should be omitted. I can't think of any case where removing a non-whitelisted attribute from a whitelisted element would make it unsafe.