rezakho / ganon

Automatically exported from code.google.com/p/ganon
0 stars 0 forks source link

Strange MS conditionals break parsing #68

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What will reproduce the problem?

I have included a small htmltest.php to demonstrate the issue. Based on what I 
read on the web, I believe it is caused by webpages that contain html which has 
been copy/pasted from MS Word and similar (bad practice, but there are some of 
them out there). I have seen this on some of the websites I have tried to parse 
with ganon.

The pastes contain some strange conditionals there are actually comments, 
probably interpreted by the older IE browsers.

I have added a through explanaition below and a "fixed" version is also 
attached.

What is the expected output? What do you see instead?

The DOM created by the parser only holds the code up to the first occurence of 
a "strange" conditional tag.

Which version are you using?
78

Please provide any additional information below.

Thanks for the good work done to create this HTML-parser! It has worked well 
for me on many occations.

Then I came accross some web-sites where the results were not as expected. 
Parts of the page were missing in the parsed result.

Investigations showed that it is caused by some strange conditional comments 
aparently inserted to pass and hide code for Internet Explorer - probably older 
versions.

Ganon HTML-parser does handle conditionals as described in the standard:

<!--[if IE]> ......<![endif]--> to hide code from standard browsers and

<![if !IE]> .......<![endif]> to show code only in standard browsers.

But some web-pages have code like:

<!--[if !ListSupported]-->......<!--[endif]-->

While most of us can agree that this is bogus code that shouldn't really be 
there, it breaks the parsing because ganon sees the "<!--[if" and correctly 
assumes this is a conditional. It then fails to find the "]>" that ends this. 
As a result the rest of the file is skipped.

I have considered various hacks to make this parse correctly, non of them are 
very pretty.

The new ganon.php I have included implements a "look-ahead" function called 
if_conditional() in HTML_Parser_Base (line 581).

The function is used in parse_tag() of the same class (line 527). When it has 
been determined that the tag starts with "<!--[if" it also calls the new 
function which has to return true for the element to be parsed as a conditional.

The function looks ahead from the current position to find the next ']' and 
then '>'. It then looks to if the characters before the '>' match '--'. If they 
do the function returns false.

As a result the tag is parsed as a comment, NOT as a conditional. For my 
purposes this works!

I have included a small test that illustrates the issue and checks that the 
original uses of the conditional tag are still parsed correctly.

There may be better and more elegant ways of acheiving this result, but this 
has worked for me.

Kind regards

Torben from Denmark 

Original issue reported on code.google.com by TorbenEl...@gmail.com on 12 Jul 2015 at 8:52

Attachments: