w3c / mathml-core

MathML Core draft
https://w3c.github.io/mathml-core
40 stars 14 forks source link

MathML support in the HTML Sanitizer API #227

Open fred-wang opened 8 months ago

fred-wang commented 8 months ago

See https://wicg.github.io/sanitizer-api/

Some work has been done to hande mathml/svg namespaces but the spec should likely specify a default safelist, see https://github.com/WICG/sanitizer-api/issues/103#issuecomment-2009143357 (IIRC, the API allows web dev to accept more element/attributes that are not in the safelist, though)

So this issue is about discussing what we want to suggest as a default safelist for MathML.

In another issue, I had commented to try and follow MathML Core as much as possible as that's what browsers are expected to implement: https://github.com/WICG/sanitizer-api/issues/167#issuecomment-1415147702

Some more comments:

Firefox has some safe list already but I guess it is not very strict, for example it still allows XLink href or content mathml markup. The bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1787594

For Chromium, I don't remember without checking more. But probably it does not include more than what is in MathML Core, since we never implemented more.

I'm not sure if the sanitzer api is actually being implemented in webkit.

dginev commented 8 months ago

I see the current sanitization algorithms have a configurable on/off toggle for data- attributes:

  1. If "data-" is a code unit prefix of local name and if namespace is null and if config["dataAttributes"] exists and is false:

    1. Remove attr from child.

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

fred-wang commented 8 months ago

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

Right, I believe we should have both 1) a strict default subset implemented natively in browsers and 2) a way to relax it for web developers using the API. I haven't read the spec for a while, but that's how I had understood the situation for HTML. It would be good if MathML folks could spend some time to ensure this is the case for MathML too.

polx commented 8 months ago

I think that MathML in general is safe. Any element except maybe those who contain script oriented ones should be in the accept-list.

dginev commented 8 months ago

As requested in the Math WG meeting on March 28, here is a link to the MathML-related CVEs on record, currently 8:

NIST national vulnerabilities database, keyword MathML

I remembered noticing back in the day that cases such as CVE-2021-38193 and CVE-2020-26870 appeared to be examples where switching between parsing contexts hosted exploits. This was the reason I flagged annotation-xml as a potential exploit vector in my previous comment, since depending on how its non-standard contents are processed, an implementation may hit other DOM-related edge cases.

annotation should just be character data, so likely a step safer (and more directly comparable to data-* attributes).

polx commented 7 months ago

Hello @fred-wang and all,

we discussed the subject in the MathML-core meeting yesterday and I think that the following seems to have met everyone's agreement:

We converged on the fact that the skeptiscism about the security of data-* attributes is similar to that about annotation and annotation-xml elements: The problem is not a problem about MathML itself but it may be important to shut-the-tap proactively as it may become a problem of every users. Simplest examples include LaTeX source code fragments. Thus we propose, as @dginev hinted, that:

1) the sanitzier API should have a switch to wipe out or not the annotation and annotation-xml elements' non-MathML-ingredients.

Other than that we see sanitziation needs to wipe-out:

2) for any namespace-declaration or external entity references that is not entirely produced synthetically by the browser (in particular the MathML, SVG and HTML namespaces should be considered safe but others probably not)

3) the actiontype attribute of maction if it is not the standard value statusline or toggle

4) the src attribute of annotation and annotation-xml (as some parsers may consider this similar to an img element): sanitizers may inline that content instead.

We have also considered it important that this issue carries a few examples of potentials that the sanitizer's inclusion of the MathML elements may bring.

Finally, we have highlighted the potentials of TrustedTypes as an application that may be relevant for the sanitizers. But so far, I see this as a potential only. I would suggest that we request that the MathWG or Math CG be "called back" when TrustedTypes may intersect the sanitizer APIs beyond its current scope (which I understand to be a baseline converter to transform web-content in something that can be exchanged in a way considered safer further than the browser's current page).

Do you agree with the approach proposed in the numbers 1 to 4. Then I suggest we go to the sanitizer API issues and make that suggestion as a safe list.

thanks in advance.

Paul

fred-wang commented 3 months ago

Hi,

Sorry for the late reply, I overlooked this was directed to me.

In general I don't have strong opinion on this, the sanitizer API is implemented in a relatively part of browser code that is relatively independent from MathML rendering. It should be fine to go ahead and talk to the people working on the sanitizer API spec, finding a consensus there. I didn't check what was the latest status regarding non-HTML namespace.

Probably the main thing to pay attention is that MathML Core is targetted for browsers while MathML Full is used in other applications. So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Note that Firefox's sanitization currently accepts content markup but at the cost of adding many atomic strings for each content MathML tag: https://bugzilla.mozilla.org/show_bug.cgi?id=1787594#c8

Regarding security/safety in browsers, the one I'm aware of are described in https://w3c.github.io/mathml-core/#security-considerations and https://w3c.github.io/mathml-core/#privacy-considerations ; in particular href is the one that can cause problems (unfortunately the discussions regarding its inclusion in MathML Core is on hold). Note also the case of maction statusline (whose support was removed from browsers).

dginev commented 3 months ago

@fred-wang

So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Your comment made me wonder why/how the elements annotation, annotation-xml and their container semantics made it into MathML Core.

If they are to be useful, their contents should be able to survive sanitization, at least in some cases. If not, maybe they are better thought of as MathML Full elements?

As a cross-spec thought: SVG has a construct similar to <annotation-xml> called foreignObject, but unlike the annotation elements, foreign objects are also used for rendering. So the HTML parser has defined (and limited) behavior over them, accepting HTML+SVG+MathML content. I wonder if the sanitization considerations for <annotation-xml> should be similar to that case?

Content MathML is indeed the classic use of <annotation-xml> to consider. And a common use of <annotation> is to carry a source format representation (usually TeX). Both kinds of annotations may be useful data structures to expose for JavaScript apps. But that ship may have also sailed at this point, I myself don't have a strong opinion on the direction here. This is part of the conversation I was hoping the group will have during the next charter of the Math WG.

polx commented 1 week ago

Maybe the right thing to do here is to be minimal first so that something comes through. The annotation* family of elements allow arbitrary content whose security needs to be warned against (even TeX can have its security issues, I think).

While I agree we should strive for something useful, and href attributes and annotation* can be useful, they do not appear controllable to me (except maybe having well known media-types?).

bkardell commented 1 week ago

So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Per the working group meeting today, we resolved that we feel it is ok to begin with MathML-core and we'd like to move that forward

polx commented 5 days ago

Something near this was discussed in, with, and about the fediverse tools:

polx commented 5 days ago

Here is my proposal:

MathML-core considers all but the following elements to be safely exchanged and do not need a sanitization, e.g. when they are converted from a display to an input enviroment. We recommend the Sanitzer API (ref) to sanitize MathML by keeping all elements and attributes except:

  • the on* attributes (aimed at event handlers),
  • the maction elements if the actiontype attribute is of the value statusline (the element can be replaced by it first child),
  • any annotation or annotation-xml element whose encoding attribute is of a media-type that is is either absent or is not among the trusted types or if it contains an href attribute.

I was unsure about two aspects:

Thanks for your feedback.