milkshakesoftware / PreMailer.Net

C# library that moves your stylesheets to inline style attributes, for maximum compatibility with E-mail clients.
http://milkshakesoftware.github.com/PreMailer.Net/
MIT License
652 stars 116 forks source link

MoveCssInline encodes non-ASCII characters even when they should be valid HTML #193

Open CaptainStack opened 4 years ago

CaptainStack commented 4 years ago

I am seeing multiple variants of this issue and it is often treated as a closed or non-issue, but it is currently completely blocking my work and I need a fix or a workaround. I am using PreMailer.Net version 2.0.1.0 on Windows 10 in a C#/ASP.NET project.

Like in this issue, MoveCssInline is changing characters like '&' my URLs. For example:

<a href="http://www.website.com/page?param1=a&param2=b"></a>

Is changed to:

<a href="http://www.website.com/page?param1=a&amp;param2=b"></a>

Most of the URLs I work with contain ampersands because we use form codes and several other query parameters. I need to inline CSS into the HTML, but I do not have control over the URLs in the document and I am not allowed to change them.

One of the responses on the issue I linked earlier points out that properly encoded strings on attributes is a part of the HTML specification and that therefore the output is correct.

But PreMailer.Net is not an HTML validation or sanitation utility - it is a CSS inliner and should not have other side effects on the document if possible.

Additionally, I have tested further and found that this encoding is not just done on attributes like href. It in fact will also encode text/InnerHTML values, which are absolutely valid html without encoding. Example:

<p>&</p>

This is valid HTML and should not be encoded, but PreMailer.Net will change this to:

<p>&amp;</p>

I am desperate for a fix or workaround, please help. I have also looked at the following issues for help:

Update

After a bit more digging, I found this issue which suggests it is caused by a PreMailer.Net dependency called AngleSharp, which parses the HTML document. When it re-outputs the HTML it runs a function called EscapeText which escapes these characters. According to this issue, this is by-design as it is in line with the HTML spec.

However, I think this is still an issue for PreMailer.Net and even AngleSharp, which should not be making these changes to input HTML unless requested/specified by the caller.

Update 2

I have been working with the AngleSharp folks and believe I will be able to send a PR with an option to suppress this encoding behavior soon by passing a custom formatter.

DarthSonic commented 1 month ago

This issue still exists. I have the same problem. My URLs get invalid because of that!