jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 249 forks source link

Utf8 double encoding problem - ü rendered as ü #178

Open firepol opened 9 years ago

firepol commented 9 years ago

Hi, I am having headache after investigating some hours for this issue.

I'm trying to render this URL: http://airolo.ch/impianti/details.php?lang=ita&season=winter

Like this:

var initialHtml = CQ.CreateFromUrl("http://airolo.ch/impianti/details.php?lang=ita&season=winter");
var cssTarget = initialHtml[".container"];
string cssResult = cssTarget.FirstOrDefault().Render();

In the cssResult string I expected to get "Pesciüm" encoded like this:

Pesciüm

What I get, instead, is:

Pesciüm

I tried also:

string cssResult = cssTarget.FirstOrDefault().Render(OutputFormatters.HtmlEncodingNone);

In the cssResult string I expected to get "Pesciüm" (as in the original file), what I get, instead: "Pesciüm"

I think that CsQuery is double encoding utf8. The problem can be seen also in this blog post: http://www.bardecode.com/en1/double-encoded-utf-8-strings-in-c/

I tried another url of a German website, full of words wit umlauts, but there I don't experience the same problem. So maybe in this case the encoding is not properly detected? Is there a proper way to deal with such cases automatically (without knowing what the encoding of the original website is)?

I tried to use the method suggested in the blog, however that produces other problems (non breaking spaces converted to strange characters).

I've seen in the CsQuery that it's possible to implement custom implementation of an OutputFormatter, but maybe you already have a solution for this?

I'm not sure if this is a CsQuery bug or another problem... I'd really appreciate if you can help, thank you.

rufanov commented 9 years ago

Can't reproduce your problem. Is it fixed already? For me, your code contains Pesciüm as expected, not Pesciüm.