jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 250 forks source link

.Render() produces the output in numeric character references #152

Closed vorou closed 10 years ago

vorou commented 10 years ago

Here's the code:

var page = CQ.CreateFromUrl("http://ru.wikipedia.org/wiki/%D0%9A%D0%BE%D0%B4%D1%8B_%D1%81%D1%83%D0%B1%D1%8A%D0%B5%D0%BA%D1%82%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B9%D1%81%D0%BA%D0%BE%D0%B9_%D0%A4%D0%B5%D0%B4%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D0%B8");
Console.Out.WriteLine(page.Render());

Page is in UTF-8, and it's detected properly within CreateFromUrl. However, here's a fragment of the output:

...
<title>&#1050;&#1086;&#1076;&#1099; &#1089;&#1091;&#1073;&#1098;&#1077;&#1082;&#1090;&#1086;&#1074; &#1056;&#1086;&#1089;&#1089;&#1080;&#1081;&#1089;&#1082;&#1086;&#1081; &#1060;&#1077;&#1076;&#1077;&#1088;&#1072;&#1094;&#1080;&#1080; &#8212; &#1042;&#1080;&#1082;&#1080;&#1087;&#1077;&#1076;&#1080;&#1103;</title>
<meta http-equiv="X-UA-Compatible" content="IE=EDGE">
<meta name="generator" content="MediaWiki 1.23wmf17">
<link rel="alternate" type="application/x-wiki" title="Править" ...

Notice that Russian text looks OK in title attribute (last line), but in title's text it's in numeric char reference form.

Am I doing something wrong?

jamietre commented 10 years ago

Render by default encodes non-ascii characters but you can easily change this, e.g.

Console.Out.WriteLine(page.Render(OutputFormatters.HtmlEncodingMinimum));

Detailed explanation of the OutputFormatter object us here:

https://github.com/jamietre/CsQuery/blob/master/documentation/render.md

vorou commented 10 years ago

Oh, thank you, I really should have looked for it harder.