Whitespace Html results in malformed result

picasso566 commented 2 years ago

First of all, thank you for the simple, straight forward tool!

I have an editor in a page and I'm validating the html on an api endpoint. I copy some markup from a page and paste it into the editor and HtmlSanitizer does a wonderful job at closing voids and removing unwanted attribs, tags, elements...

I ran into an issue where the html I am copying only contains characters and the formatter outputs a some character (I don't know what it is and I haven't had time to track this down in source). Example Below.

I probably need to create my own formatter and using your default as a starting point, how can I just ignore characters? Starting with this: https://github.com/mganss/HtmlSanitizer/blob/61008c6d0e492e641510726da881ee0c9577c305/src/HtmlSanitizer/HtmlFormatter.cs How can Ignore characters as I need to preserve them?

TIA

Example:

From this page: https://marketplace.visualstudio.com/items?itemName=MadsKristensen.WebCompiler I copy the text <p>A NuGet package will be installed into the <code>packages</code> folder without adding any files to the project itself. The NuGet package contains an MSBuild task that will run the exact same compilers on the <code>compilerconfig.json</code> file in the root of the project.</p>

After parsing with the sanitizer: <p>A NuGet package will be installed into the�<code>packages</code>�folder without adding any files to the project itself. The NuGet package contains an MSBuild task that will run the exact same compilers on the�<code>compilerconfig.json</code>�file in the root of the project.</p>

picasso566 commented 2 years ago

I did not look deep enough. Looks like the issue is in AngelSharp. Because at it's core it's an xml parser, there are characters I want to allow that are illegal in xml. All over AngelSharp are character replacements for illegal characters, such as this helper in Symbols.cs:

        public const Char Replacement = (Char)0xfffd;

On the one hand I agree, that in a pure sense, we may want to sanitize these characters, but because it's being parsed as xml it's difficult to add the option to ignore them as well. Also, why replace a character found outside a node with the "replacement character" symbol? Why not have an option to strip all of them instead?

There are also other things I will try such as wrapping the root node in a wrapper div and then removing it after, or wrapping spaces in spans (or something?). I will leave this open until I find something that works and then close it.

If anyone else has a better idea, (or an alternate parser that does not use an xml parser) please let me know.

Thanks again for the utility!

picasso566 commented 2 years ago

It seems that once again, the meaning of my life is to serve as a warning to others. I don't know whether to delete this thread or leave it; don't know if it's actually helpful. The issue had nothing to do with the library. Somewhere between escaping the html fragment and .net parsing it into an object, the whitespace character was already output as the unicode "replacement character."

To handle this, I am doing this for now and just stripping all of the whitespace characters (except a normal space) until I figure how to handle it properly.

input = input.replace(/[\u00A0\u1680\u180e\u2000-\u2009\u200a\u200b\u202f\u205f\u3000]/g, '');

tiesont commented 2 years ago

@picasso566 Since AngleSharp is (unfortunately) a fairly hard dependency of HtmlSanitizer, it is worth noting when something odd comes up, even it winds up being (like you found out) an AngleSharp issue.

picasso566 commented 2 years ago

In this particular case, it wasn't even AngelSharp.

The .net mvc object parser was converting this: <div><p>Something something this that <code>Some core sample</code></p></div> to this <div><p>Something something this that�<code>Some core sample</code></p></div>

This was before calling the HtmlSanitizer

Or I should say, the escaped version of the first line was parsed as the second. I didn't have time to figure it out so I hacked a solution for now. Thanks for your help.

mganss / HtmlSanitizer

Whitespace Html results in malformed result #393