jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.15k stars 249 forks source link

XML document DocType not set correctly when using CreateFromUrl #155

Closed greg84 closed 10 years ago

greg84 commented 10 years ago

I'm trying to load an XML document for parsing with CsQuery. Part of it has the following markup:

<link>
  /here-is/a-link
</link>

The <link> tag has no content in HTML5 (it's self-closing) so I want this document to be parsed as XML where I assume the following should work to get the text between the tags.

cq.Select("link").Text()

...but this returns an empty string.

I load the document using this method (this is the actual URL I'm loading so you can see the XML document I'm working with):

var cq = CQ.CreateFromUrl("https://www.barnsley.gov.uk/news-and-events/news/rss");

After the document has been loaded the value of cq.Document.DocType is HTML5, shouldn't it be something else because it's an XML response? Is it an issue with CsQuery or the web site? I've read the page about character encoding but can't see why this isn't working.

greg84 commented 10 years ago

I'm starting to wonder whether I should be using CsQuery for parsing XML! Don't think it was designed to do that. Would be nice if I could get it to work though.

I've implemented a "dirty hack", which replaces tags that don't allow content in HTML with tags that do before I create the CQ instance. So for example, <link> becomes <div data-xml="link"> - it works but it's a bit filthy.

jamietre commented 10 years ago

True - it is not really designed for parsing XML. The real outcome here is driven by HtmlParserSharp (otherwise known as the validator.nu HTML parser) which is a true HTML parser and follows the spec rules.

There are a couple other threads in the issues here about HTML parsing. There are some things that could be done to make it work right/better but I haven't had time to look into it.