jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 249 forks source link

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

Open jasonkres opened 9 years ago

jasonkres commented 9 years ago

Characters that should be escaped are not escaped when they occur in attribute values. That is, a valid HTML5 document (e.g., per https://html5.validator.nu/) can be passed to CsQuery and come out as an invalid HTML5 document.

I have tested this 1.3.3249 (NuGet) and CsQuery master. The problem reproduces in both.

Not sure, but suspect that because this affects escaping, this may be a security problem where undesired HTML can be injected.

Program:

using System;
using CsQuery;

namespace CsQueryTestApp
{
    class Program
    {
        static void Main(string[] args)
        {
            string h = "<!DOCTYPE html><html><head><meta charset=\"UTF-8\"><title>test</title></head><body><a href=\"http://www.example.com?a=1&amp;b=2\">Hello &amp; World</a></body></html>";
            Console.WriteLine("Valid per https://html5.validator.nu/:");
            Console.WriteLine(h);
            Console.WriteLine("Invalid per https://html5.validator.nu/:");
            Console.WriteLine(CQ.CreateDocument(h).Render());
        }
    }
}

Output:

Valid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&amp;b=2">Hello &amp; World</a></body></html>
Invalid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&b=2">Hello &amp; World</a></body></html>

Message from validator:

Error: & did not start a character reference. (& probably should have been escaped as &amp;.)
At line 1, column 117
example.com?a=1&b=2">Hello &am
schmalls commented 9 years ago

I have also encountered this issue with quotes and angle brackets.

Before:

<span class="email" data-href="mailto:&quot;John Doe&quot; &lt;john.doe@example.com&gt;">

After:

<span class="email" data-href='mailto:"John Doe" <john.doe@example.com>'>

As you can see, the entity references disappeared and the quote style switched. The angle brackets in the attribute cause it to no longer be valid.

jamietre commented 9 years ago

The angle brackets are quoted, they don't need to transformed. That validates fine. The & is a problem.

schmalls commented 9 years ago

I am using Prince to convert from XHTML to PDF and it gives me an error saying the Unescaped '<' not allowed in attribute values. So, it may be they are more strict than the validator. My question is, why are the entities being converted in the first place? If they weren't, it would fix both issues.

jamietre commented 9 years ago

CsQuery is a parser.. Though this is really HtmlParser at work here for the input When it loads a document it stores the actual value and not whatever representation created it from the source. In a pre tag for example an & is really an &. If we just stored the ASCII that came from the source then it would be impossible to treat text data as something that is usable in any context.

But HtmlParser is not broken. The problem is just in how I render it. & needs to be escaped in attribute values always.

I am using Prince http://www.princexml.com to convert from XHTML to PDF and it gives me an error saying the Unescaped '<' not allowed in attribute values. So, it may be they are more strict than the validator. My question is, why are the entities being converted in the first place? If they weren't, it would fix both issues.

— Reply to this email directly or view it on GitHub https://github.com/jamietre/CsQuery/issues/198#issuecomment-132738639.

jasonkres commented 9 years ago

So, it may be they are more strict than the validator.

Ideally, should be pluggable how these are handled when they occur in attributes. ?? Similar to how you handle other encoding issues with the HtmlEncoders ?? For this to be most useful, the encoder routine needs some context (especially: whether ' or " is the currently expected attribute delimiter).

jamietre commented 9 years ago

The code is here

public static string AttributeEncode(string text, bool alwaysQuote, out string quoteChar)...

It handles the quote context, but not the ampersand. The ampersand is the only character that requires encoding, and technically it only needs to be encoded if it is ambiguous -- so I'm actually surprised that validator.nu fails the example above.

I agree, there should be some control over how the attributes are generated. It might be desirable to use escaping for even legitimate characters for the same reason as elsewhere in the document (e.g. just for readability of unprintable characters in the source code). I can't think of a reason why the same translation used for the rest of the HTML wouldn't work in the attribute context.

There handling chooses apostrophe or double-quote in order to minimize the need for escaping quotes, but seems like this should be choosable (e.g. single quote, double quote, or "best fit") anyway.

jasonkres commented 9 years ago

and technically it only needs to be encoded if it is ambiguous

Of course, if the document is XHTML, then there is no flexibility on ampersands. In the case of working with the XML serialization of HTML5, you can't even tell from the doctype whether it is supposed to be XML, so I guess if you have <!DOCTYPE html> you have to either play safe: look for the http://www.w3.org/1999/xhtml namespace, or provide options for the user to tell you want they want to do. So flexibility on this is good...

schmalls commented 9 years ago

There are three characters that have to be escaped for attributes in XHTML it looks like from the XML specification. They are <, &, and ".

jasonkres commented 9 years ago

Yes, escaping < is actually a must in XML spec; & and whichever of " or ' is not the delimiter, (of course.) The relevant production from the spec is (where ^ indicates negation):

AttValue ::= '"' ([^<&"] | Reference)* '"'  
             | "'" ([^<&'] | Reference)* "'"