w3c / mathml-core

MathML Core draft
https://w3c.github.io/mathml-core
34 stars 14 forks source link

MathML support in the HTML Sanitizer API #227

Open fred-wang opened 3 months ago

fred-wang commented 3 months ago

See https://wicg.github.io/sanitizer-api/

Some work has been done to hande mathml/svg namespaces but the spec should likely specify a default safelist, see https://github.com/WICG/sanitizer-api/issues/103#issuecomment-2009143357 (IIRC, the API allows web dev to accept more element/attributes that are not in the safelist, though)

So this issue is about discussing what we want to suggest as a default safelist for MathML.

In another issue, I had commented to try and follow MathML Core as much as possible as that's what browsers are expected to implement: https://github.com/WICG/sanitizer-api/issues/167#issuecomment-1415147702

Some more comments:

Firefox has some safe list already but I guess it is not very strict, for example it still allows XLink href or content mathml markup. The bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1787594

For Chromium, I don't remember without checking more. But probably it does not include more than what is in MathML Core, since we never implemented more.

I'm not sure if the sanitzer api is actually being implemented in webkit.

dginev commented 3 months ago

I see the current sanitization algorithms have a configurable on/off toggle for data- attributes:

  1. If "data-" is a code unit prefix of local name and if namespace is null and if config["dataAttributes"] exists and is false:

    1. Remove attr from child.

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

fred-wang commented 3 months ago

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

Right, I believe we should have both 1) a strict default subset implemented natively in browsers and 2) a way to relax it for web developers using the API. I haven't read the spec for a while, but that's how I had understood the situation for HTML. It would be good if MathML folks could spend some time to ensure this is the case for MathML too.

polx commented 3 months ago

I think that MathML in general is safe. Any element except maybe those who contain script oriented ones should be in the accept-list.

dginev commented 3 months ago

As requested in the Math WG meeting on March 28, here is a link to the MathML-related CVEs on record, currently 8:

NIST national vulnerabilities database, keyword MathML

I remembered noticing back in the day that cases such as CVE-2021-38193 and CVE-2020-26870 appeared to be examples where switching between parsing contexts hosted exploits. This was the reason I flagged annotation-xml as a potential exploit vector in my previous comment, since depending on how its non-standard contents are processed, an implementation may hit other DOM-related edge cases.

annotation should just be character data, so likely a step safer (and more directly comparable to data-* attributes).

polx commented 2 months ago

Hello @fred-wang and all,

we discussed the subject in the MathML-core meeting yesterday and I think that the following seems to have met everyone's agreement:

We converged on the fact that the skeptiscism about the security of data-* attributes is similar to that about annotation and annotation-xml elements: The problem is not a problem about MathML itself but it may be important to shut-the-tap proactively as it may become a problem of every users. Simplest examples include LaTeX source code fragments. Thus we propose, as @dginev hinted, that:

1) the sanitzier API should have a switch to wipe out or not the annotation and annotation-xml elements' non-MathML-ingredients.

Other than that we see sanitziation needs to wipe-out:

2) for any namespace-declaration or external entity references that is not entirely produced synthetically by the browser (in particular the MathML, SVG and HTML namespaces should be considered safe but others probably not)

3) the actiontype attribute of maction if it is not the standard value statusline or toggle

4) the src attribute of annotation and annotation-xml (as some parsers may consider this similar to an img element): sanitizers may inline that content instead.

We have also considered it important that this issue carries a few examples of potentials that the sanitizer's inclusion of the MathML elements may bring.

Finally, we have highlighted the potentials of TrustedTypes as an application that may be relevant for the sanitizers. But so far, I see this as a potential only. I would suggest that we request that the MathWG or Math CG be "called back" when TrustedTypes may intersect the sanitizer APIs beyond its current scope (which I understand to be a baseline converter to transform web-content in something that can be exchanged in a way considered safer further than the browser's current page).

Do you agree with the approach proposed in the numbers 1 to 4. Then I suggest we go to the sanitizer API issues and make that suggestion as a safe list.

thanks in advance.

Paul