mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.57k stars 199 forks source link

Style attribute triggers a RemovingAttribute Event with an empty value #504

Open florianculie opened 12 months ago

florianculie commented 12 months ago

Hi,

We are using the HtmlSanitizer 8.0.746 release to filter user input. To be precise, we do not sanitize it, we forbid the user to write the data if HtmlSanitizer triggers a RemovingN event.

We are having problems regarding the following HTML string :

<p style="text-align: start;">this is the content of the p tag</p>

We have a custom RemovingAttributeEvent that we are adding like this :

private void InitSanitizerEvents()
{
    // We adapt all the following events to set the IsValid variable to false if one of them trigger.
    // We also log the error message according to what was detected.
    _sanitizer.RemovingTag += RemovingTagEvent;
    _sanitizer.RemovingAtRule += RemovingAtRuleEvent;
    _sanitizer.RemovingStyle += RemovingStyleEvent;
    _sanitizer.RemovingAttribute += RemovingAttributeEvent;
    _sanitizer.RemovingComment += RemovingCommentEvent;
    _sanitizer.RemovingCssClass += RemovingCssClassEvent;
}

The code of our custom event is the following :

private void RemovingAttributeEvent(object sender, RemovingAttributeEventArgs args)
{
    string attName = args.Attribute.Name;
    string attValue = args.Attribute.Value;

    // Check if this is an event with allowed values, and if the value is allowed.
    if (_allowedEvents.TryGetValue(attName, out ISet<string> allowedValues))
    {
        if (allowedValues.Contains(attValue))
            args.Cancel = true;
    }
    else
    {
        // Handle specific cases in a switch
        switch (attName)
        {
            // We need this case as the 'class' attribute trigger this event, but we don't want to treat it here
            case "class":
                args.Cancel = true;
                break;
            case "src":
                if (attValue.StartsWith("data:image/", StringComparison.Ordinal))
                    args.Cancel = true;
                else
                    OnUnauthorizedLinkError(attValue);

                break;
            case "href":
                if (attValue.StartsWith("mailto:", StringComparison.Ordinal))
                {
                    try
                    {
                        var providedEmail = attValue.Split(':')[1];
                        var email = new System.Net.Mail.MailAddress(providedEmail);

                        args.Cancel = email.Address.Equals(providedEmail, StringComparison.Ordinal);
                    }
                    catch (Exception ex) when (ex is ArgumentException || ex is FormatException)
                    {/* error in parsing value means that email is invalid, which leads to security error. */}
                }
                else
                    OnUnauthorizedLinkError(attValue);

                break;
        }
    }

    if (!args.Cancel)
    {
        _isRichTextValid = false;
        _errors.Add(new CssAttributeError(args.Reason.ToString(), args.Tag.TagName, attName));
    }
}

To put it simply, base on the logic I described at the begining, we cancel the RemovingAttribute Event if the reason for the trigger is considered safe for us, otherwise we confirm the trigger, and we add the reason to the _error object to log all reasons.

The problem we are facing is the fact that the event is being triggered with an "incomplete" html, as you can see below : image Which gives us an args.Attribute.Name = "style" but an empty value for the actual style.

During the investigation of the issue, I did notice we were using the 6.0 version of your package, and by updating to 8.0, I still reproduce it. On a side note, we have been using your package for 6 year now, and we do not reproduce this issue with the 4.0.183 version, the version being used in an old version of our software.

Do you happen to have any idea if this is a normal behaviour or a bug ?

mganss commented 12 months ago

I can't reproduce the issue. Can you try and add a code snippet here that shows the issue?

This is what I tried:

var html = @"<p style=""text-align: start;"">this is the content of the p tag</p>";
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedCssProperties.Remove("text-align");
sanitizer.AllowedAttributes.Remove("style");
sanitizer.RemovingAttribute += (s, e) =>
{
    Assert.Equal("text-align: start;", e.Tag.Attributes[0].Value);
};
var output = sanitizer.Sanitize(html);
florianculie commented 12 months ago

Here's the Program.cs content from a new .NET 6 Console App :

using Ganss.Xss;

namespace ConsoleApp1
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello, World!");

            var html = @"<p style=""text-align: start;"">this is the content of the p tag</p>";
            var sanitizer = new HtmlSanitizer();

//If you comment the two following lines, you will reproduce
            sanitizer.AllowedCssProperties.Remove("text-align");
            sanitizer.AllowedAttributes.Remove("style");
//
            sanitizer.RemovingAttribute += (s, e) =>
            {
                string test = e.Tag.Attributes[0].Value;
            };
            var output = sanitizer.Sanitize(html);

        }
    }
}

If I use your code, with the two .Remove calls, I have no issue. However, if I comment those two lines (since I don't want to disallow them), you can see that the

e.Tag.Attributes[0].Value;

is an empty string, and the e.Tag.OuterHtml does not have the style value : image

mganss commented 12 months ago

OK thanks, I can repro now. This occurs because the parsed style is empty in AngleSharp's CSSOM. Strangely, it does not occur if the style's value is text-align: left for example. On MDN it says for the initial value of text-align

start, or a nameless value that acts as left if direction is ltr, right if direction is rtl if start is not supported by the browser.

Perhaps start is not supported by AngleSharp and that's why it's not reflected in the OM? I'll open an issue in the AngleSharp.Css repo to find out what's going on.

mganss commented 12 months ago

I have reported as https://github.com/AngleSharp/AngleSharp.Css/issues/151

mganss commented 12 months ago

The start value for text-align is not yet supported by AngleSharp.Css.

tiesont commented 12 months ago

For what it's worth, version 4.0.1830 uses AngleSharp 0.9.9.1. AngleSharp split the CSS processing to a separate library in 0.10.0, so I assume any version of HtmlSanitizer from v5 onwards (which uses AngleSharp 0.13.0) probably behaves differently with respect to CSS parsing.

florianculie commented 12 months ago

Sorry for the delay in response. Thanks for your analysis. As it seems it's an invalid value, I think we will remove the style attribute from the problematic data for the project.

If I am correct with my understanding, we need to wait for Anglesharp to fix the issue your declared earlier, then migrate to the latest pre-release of HtmlSanitizer that uses this AngleSharp release. Is that it ? If so, do you happen to have a "know issues" with the newest version of AngleSharp ?

Thanks for your time on this subject.

mganss commented 12 months ago

@tiesont You're right. This scenario used to work before AngleSharp 0.10, see https://github.com/AngleSharp/AngleSharp.Css/issues/151#issuecomment-1823222961

@florianculie My hope is that the next release of AngleSharp.Css will be 1.0 which would mean you would no longer need a prerelease version of HtmlSanitizer. But in general, yes, you would need to wait for a new release of AngleSharp.Css.

I'm not aware of any issues with the newest versions of AngleSharp and AngleSharp.Css. This issue is the only regression I'm aware of.