jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.15k stars 249 forks source link

CQ chokes when xml declaration is missing encoding attribute #165

Open asinning opened 10 years ago

asinning commented 10 years ago

When CsQuery tries to parse this xml using

CQ dom = xml;

<?xml version="1.0"?> <container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container"> <rootfiles> <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/> </rootfiles> </container>

I get the following error:

System.NullReferenceException: Object reference not set to an instance of an object. Result StackTrace: at CsQuery.HtmlParser.ElementFactory.Parse(Stream inputStream, Encoding encoding) at CsQuery.HtmlParser.ElementFactory.Create(Stream html, Encoding streamEncoding, HtmlParsingMode parsingMode, HtmlParsingOptions parsingOptions, DocType docType) at CsQuery.CQ.CreateNew(CQ target, Stream html, Encoding encoding, HtmlParsingMode parsingMode, HtmlParsingOptions parsingOptions, DocType docType) at CsQuery.CQ..ctor(String html, HtmlParsingMode parsingMode, HtmlParsingOptions parsingOptions, DocType docType) at CsQuery.CQ.op_Implicit(String html)

I can eliminate the error by changing the xml declaration to include an encoding attribute:

<?xml version="1.0" encoding="UTF-8"?>

Thanks!

jamietre commented 10 years ago

In all honesty I haven't spent a lot of time trying to make CsQuery work as a general purpose XML parser. While it might work for some XML (XHTML) it may or may not handle generic XML properly in all cases, since XHTML is a subset of XML.

asinning commented 10 years ago

I've written the following wrapper to fix the problem. It could stand to be made more robust.

private CQ GetCQ(string xml)
    {
        //  xml should really be trimmed first
        if (xml.IndexOf("<?xml") == 0)
        {
            if (xml.IndexOf(">") > 0)
            {
                var declaration = xml.Substring(0, xml.IndexOf("?>"));
                if (declaration.IndexOf("encoding") == -1)
                {
                    declaration = declaration + " encoding=\"UTF-8\"";
                    xml = declaration + xml.Substring(xml.IndexOf("?>"));
                }
            }
        }
        return new CQ(xml);
    }