jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 249 forks source link

System.InvalidOperationException: The character set encoding changed twice, something seems to be wrong. #87

Closed sjdirect closed 11 years ago

sjdirect commented 11 years ago

I'm crawling millions of sites and csquery is blowing up on a lot of them. I understand why based the exception message but it happens so often I was hoping that you would consider just using the last encoding instead of blowing up.

The code.....

CQ csQueryObject;
try
{
    csQueryObject = CQ.Create(RawContent);
}
catch (Exception e)
{
    csQueryObject = CQ.Create("");

    _logger.ErrorFormat("Error occurred while loading CsQuery object for Url [{0}]", Uri);
    _logger.Error(e);
}

Example Error 1... [2013-02-19 17:44:22,764] [3898] [ERROR] - Error occurred while loading CsQuery object for Url [http://1000carats.net/Changement couleur.htm] - [Abot.Poco.CrawledPage] [2013-02-19 17:44:22,795] [3898] [ERROR] - System.InvalidOperationException: The character set encoding changed twice, something seems to be wrong. at CsQuery.HtmlParser.ElementFactory.Parse(Stream html, Encoding encoding) at CsQuery.CQ..ctor(String html, HtmlParsingMode parsingMode, HtmlParsingOptions parsingOptions, DocType docType) at Abot.Poco.CrawledPage.InitializeCsQueryDocument() - [Abot.Poco.CrawledPage]

Example Error 2... 2013-02-19 17:41:28,260] [491] [ERROR] - Error occurred while loading CsQuery object for Url [http://abalancedbodymassageinc.com/] - [Abot.Poco.CrawledPage] [2013-02-19 17:41:28,260] [491] [ERROR] - System.InvalidOperationException: The character set encoding changed twice, something seems to be wrong. at CsQuery.HtmlParser.ElementFactory.Parse(Stream html, Encoding encoding) at CsQuery.CQ..ctor(String html, HtmlParsingMode parsingMode, HtmlParsingOptions parsingOptions, DocType docType)

jamietre commented 11 years ago

I agree - and this was supposed to have been fixed in 1.3.3 (if I remember correctly) are you using the most recent release?

sjdirect commented 11 years ago

Using 1.3.3.5, initially got the binary from nuget i believe

jamietre commented 11 years ago

Yep - update. That was a bad decision based on an overly literal interpretation of the spec when I was addressing problems with character set handling. It was since undone. (must have been in 1.3.4)

jamietre commented 11 years ago

I take that back. That god forsaken exception is still in the source... grr..

jamietre commented 11 years ago

OK

nuget version 1.3.3 = build 1.3.3.5 = what you are using nuget version 1.3.4 = build 1.3.3.249

So you are one version behind - I tested the page you were failing on here, and it's working for me with 1.3.4, so I think this has already been resolved.

The exception is still in the source but I think it's an unreachable path so it's just an artifact. But I'm going to make sure it gets cleaned up before 1.3.5 anyway. It definitely shouldn't be there.

sjdirect commented 11 years ago

That did it thanks!