jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 250 forks source link

Occasional System.ArgumentOutOfRangeException .Render() #164

Open wjchristenson2 opened 9 years ago

wjchristenson2 commented 9 years ago

I'm getting the below error occasionally. It's random which is strange. It may work 95% of the time and then fail out of nowhere...

var divPricing = this.MyCQ["div.info div.pricing"].FirstOrDefault();
if (divPricing != null)
{
    CQ divPricingCQ = divPricing.Render();
}

Details: System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index at CsQuery.Implementation.DomElement.get_ClassName() at CsQuery.Implementation.DomElement.d__38.MoveNext() at CsQuery.Output.FormatDefault.RenderElementInternal(IDomObject element, TextWriter writer, Boolean includeChildren) at CsQuery.Output.FormatDefault.RenderStack(TextWriter writer) at CsQuery.Output.FormatDefault.Render(IDomObject node)

jamietre commented 9 years ago

Can you try using the prerelease package on Nuget if you are not now? This appears to be an obscure bug that was fixed some time ago but never made it to release. I will make an effort to finally get the current package into release soon. On Oct 6, 2014 10:14 AM, "wjchristenson2" notifications@github.com wrote:

I'm getting the below error occasionally. It's random which is strange. It may work 95% of the time and then fail out of nowhere...

var divPricing = this.MyCQ["div.info div.pricing"].FirstOrDefault(); if (divPricing != null) { CQ divPricingCQ = divPricing.Render(); }

Details: System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index at CsQuery.Implementation.DomElement.get_ClassName() at CsQuery.Implementation.DomElement.d__38.MoveNext() at CsQuery.Output.FormatDefault.RenderElementInternal(IDomObject element, TextWriter writer, Boolean includeChildren) at CsQuery.Output.FormatDefault.RenderStack(TextWriter writer) at CsQuery.Output.FormatDefault.Render(IDomObject node)

— Reply to this email directly or view it on GitHub https://github.com/jamietre/CsQuery/issues/164.

wjchristenson2 commented 9 years ago

Thanks. I'll use 1.3.4 for now and will wait for 1.3.5 to be released. It's good to know I can bounce to the pre-release version which should fix the issue if it is escalated.

wjchristenson2 commented 9 years ago

FYI - This just occurred again tonight running on CsQuery 1.3.5-beta5 (Prerelease).

Details: System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index at CsQuery.Implementation.DomElement.get_ClassName() at CsQuery.Implementation.DomElement.d__35.MoveNext() at CsQuery.Output.FormatDefault.RenderElementInternal(IDomObject element, TextWriter writer, Boolean includeChildren) at CsQuery.Output.FormatDefault.RenderStack(TextWriter writer) at CsQuery.Output.FormatDefault.Render(IDomObject node)

jamietre commented 9 years ago

My bad - I didn't look at the stack trace carefully when you posted this, this is not the same old bug I had been thinking of that was fixed in the prerelease. I will look into this today.

wjchristenson2 commented 9 years ago

Have you had a chance to look into this? We are getting this error daily several times a day. Let me know if you need anything from me.

Thanks!

jamietre commented 9 years ago

Sorry I didn't look at it yet... I am looking at the relevant code right now and see nothing obvious, in normal use, though I do see that this method could be susceptible to thread safety issues. That is, if the CSS classes of an element were modified at the same time as it was being rendered, this error could occur. Is this scenario possible in the way you're using it?

If this is a thread safety problem try this change to Implementation/DomElement.cs

public override string ClassName
    {
        get
        {
            if (HasClasses)
            {
                string className = "";
                lock (((ICollection)_Classes).SyncRoot)
                {
                    foreach (ushort clsId in _Classes)
                    {
                        className += (className == "" ? "" : " ") + HtmlData.TokenName(clsId);
                    }
                }
                return className;
            }
            else
            {
                return "";
            }
        }
        set
        {
            SetClassName(value);
        }
    }

.. it will probably be hard for me to repro your error so the best way to figure out this is for you to compile your own -- if this presents any problem for you I can send you a DLL

wjchristenson2 commented 9 years ago

We are definitely using CsQuery on multiple threads, although each instance of CsQuery is created and used on one thread. So, if there are static variables/etc that are not thread safe then that could be a cause.

We are not altering any of the DOM. We are simply extracting relevant information (read-only).

If this error occurs, then we take the HTML page in question and requeue it for processing later and then it typically goes through fine w/o exception a subsequent time. Its very intermittent, but definitely happens frequently.

Not to confuse the matter, but here's another error we just got.

Details: System.NullReferenceException: Object reference not set to an instance of an object. at CsQuery.Implementation.AttributeCollection.d0.MoveNext() at CsQuery.Implementation.DomElement.d35.MoveNext() at CsQuery.Output.FormatDefault.RenderElementInternal(IDomObject element, TextWriter writer, Boolean includeChildren) at CsQuery.Output.FormatDefault.RenderStack(TextWriter writer) at CsQuery.Implementation.DomElement.get_InnerHTML()

If you wanted us to test changes on our systems for awhile, sending the DLL would probably work best.

Thanks for your help.

jamietre commented 9 years ago

CsQuery is fine for use in a multithreaded environment generally - all static data is managed using threadsafe types - but individual instances of a CQ object are definitely not. Both of these errors are occurring while iterating over internal collections of their respective objects, which is why I suspect some kind of threading problem.

This is a long shot, but are you using weak references to CQ objects in your app?

We can try a build with some thread protection on the low level access methods (though there could be a lot of them) to see if it changes things but want to be sure there's nothing unusual architecturally.

wjchristenson2 commented 9 years ago

We are not using weak references to CQ objects. Sounds good on the thread protection.

wjchristenson2 commented 9 years ago

Here's another exception that works fine 99% of the time then just blew up a minute ago:

        var ulDetails = myCQ["ul.details"];
        if (ulDetails != null)
        {
            CQ ulDetailsCQ = ulDetails.RenderSelection();

            var span = ulDetailsCQ["li.stockNumber span.value"].FirstOrDefault();
        }

Details: System.NullReferenceException: Object reference not set to an instance of an object. at CsQuery.Implementation.AttributeCollection.d0.MoveNext() at CsQuery.Implementation.DomElement.d35.MoveNext() at CsQuery.Output.FormatDefault.RenderElementInternal(IDomObject element, TextWriter writer, Boolean includeChildren) at CsQuery.Output.FormatDefault.RenderStack(TextWriter writer) at CsQuery.CQ.RenderSelection(IOutputFormatter outputFormatter, StringWriter writer)

Here's the HTML:

<ul class="details">
    <li class="bodyStyle">
        <strong class="title">Bodystyle:</strong>
        <span class="value">SUV</span>
    </li>
    <li class="engine">
        <strong class="title">Engine:</strong>
        <span class="value">5.6LL V-8 cyl</span>
    </li>
    <li class="transmission">
        <strong class="title">Transmission:</strong>
        <span class="value">7-Speed Automatic</span>
    </li>
    <li class="driveLine">
        <strong class="title">Drive Line:</strong>
        <span class="value">4x4</span>
    </li>
    <li class="exteriorColor">
        <strong class="title">Exterior Color:</strong>
        <span class="value">Black Obsidian</span>
    </li>
    <li class="interiorColor">
        <strong class="title">Interior Color:</strong>
        <span class="value">Graphite w/Leather-Appointed Seat Trim w/Tuscan Bu</span>
    </li>
    <li class="stockNumber">
        <strong class="title">Stock #:</strong>
        <span class="value">A46045</span>
    </li>
</ul>
jamietre commented 9 years ago

There is something a little bit weird about that code (though nothing that should make it blow up) -

CQ ulDetailsCQ = ulDetails.RenderSelection()

RenderSelection returns a string - and assigning a string to CQ object invokes an implicit static overload of the = operator to parse the string as HTML, this is really the same as

CQ ulDetailsCQ = CQ.Create(ulDetails.RenderSelection())

It would be a lot more efficient to do:

CQ ulDetailsCQ = ulDetails.Clone();

.. and if all you are trying to do is do a sub-query then you really don't even need to do that..

var span = ulDetails.Find("li.stockNumber span.value").FirstOrDefault();

would get you the same thing without having to create a new CQ.

This is just a code review, though, I really can't think why this would be in any way responsible for the problem you are seeing.

wjchristenson2 commented 9 years ago

Thanks for the code review! Always helpful to see new approaches to getting at the same result. If the problem is with creating a new CQ object, I can try to avoid such and use the Find() method instead.

jamietre commented 9 years ago

Well, I doubt this is the problem, but it might at least shed some light on what the problem is by trying different things out.

wjchristenson2 commented 9 years ago

Update: I've changed all code with CsQuery to use lock {} to ensure that any CsQuery interaction is single-threaded. The quantity of errors has gone down slightly, but they still occur. For whatever reason, retrying the web page at a later time works. On a web page that had the error occur, I've taken the HTML which CsQuery errored on and taken it offline and ran the exact same code on it in a loop of 1,000 times and it did not error. At this point, I am out things to try.

My only thought is it may be something with memory (garbage collection) or timing.

wjchristenson2 commented 9 years ago

Another Update: After reviewing all exceptions, it appears they randomly occur with using the IDomObject.Render() method or calling the IDomObject.InnerHTML property.

We get random NullReferenceExceptions or Index out of range exceptions. I've began refactoring our code to avoid them.