umar-qureshi2 / fizzler

Automatically exported from code.google.com/p/fizzler
GNU General Public License v3.0
0 stars 0 forks source link

HTML that doesn't parse correctly (but doesn't fail either) #45

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I've been using Fizzler with great success, but today I came across some HTML 
that silently failed to parse correctly.

I was selecting all of the <a> elements and noticed that one was being ignored. 
Here are the repo steps:

1. Load the HTML from http://pastebin.com/T1Lsr6w6 (this is the "View Source" 
for http://www.diapers.com/product/productdetail.aspx?productid=16913)
2. Try to query the selector "#pdp"
3. Example code (assuming String html has the HTML above)

var doc = new HtmlDocument();
doc.LoadHtml(html);
var dom = doc.DocumentNode;
var pdpElement = dom.QuerySelector("#pdp");

What is the expected output? What do you see instead?
Expect pdpElement to be an HtmlNode of <a 
href="http://c1.diapers.com/images/products/p/pg/pg-256_1z.jpg" 
class="MagicZoomPlus" id="pdp" title="Pampers Sensitive Thick Baby Wipes Refill 
360ct." target="_blank">

Instead, it doesn't find a match.

What version of the product are you using? On what operating system?
Fizzler 0.9

Please provide any additional information below.

Original issue reported on code.google.com by portman....@gmail.com on 6 Apr 2011 at 7:36

GoogleCodeExporter commented 9 years ago
I narrowed down the error slightly.

Using VisualFizzler (neat tool!) I can see that everything up to line 282 is 
selectable (for example "#siteNav").

But after line 283, I can't select anything (for example "div.topToolBox").

So the issue has to do with long lines like on line 283 of that pastebin 
example.

Original comment by portman....@gmail.com on 6 Apr 2011 at 7:59

GoogleCodeExporter commented 9 years ago
Sure enough, when I remove this line (#283) from the HTML, everything works 
perfectly. It's pathologically long (51,553 characters in fact!!) so this is 
probably a defect in one of the underlying framework classes that Fizzler is 
using.

In the meantime, I've changed my code to chop long lines at 1024 characters 
before handing off to Fizzler, and everything is working again. But you still 
might want to investigate what precisely is going wrong on that long line, so 
I'll keep the issue open.

Original comment by portman....@gmail.com on 6 Apr 2011 at 8:08

GoogleCodeExporter commented 9 years ago
We're using HTMLAgilityPack so it's probably an issue there, but it should be 
fairly trivial to swap out HTMLAgilityPack for another parser. It could also be 
that this issue has been fixed by a more recent version of HTMLAgilityPack than 
the one in the download.

Original comment by info%colinramsay.co.uk@gtempaccount.com on 7 Apr 2011 at 1:48