What will reproduce the problem?
I have included a small htmltest.php to demonstrate the issue. Based on what I
read on the web, I believe it is caused by webpages that contain html which has
been copy/pasted from MS Word and similar (bad practice, but there are some of
them out there). I have seen this on some of the websites I have tried to parse
with ganon.
The pastes contain some strange conditionals there are actually comments,
probably interpreted by the older IE browsers.
I have added a through explanaition below and a "fixed" version is also
attached.
What is the expected output? What do you see instead?
The DOM created by the parser only holds the code up to the first occurence of
a "strange" conditional tag.
Which version are you using?
78
Please provide any additional information below.
Thanks for the good work done to create this HTML-parser! It has worked well
for me on many occations.
Then I came accross some web-sites where the results were not as expected.
Parts of the page were missing in the parsed result.
Investigations showed that it is caused by some strange conditional comments
aparently inserted to pass and hide code for Internet Explorer - probably older
versions.
Ganon HTML-parser does handle conditionals as described in the standard:
<!--[if IE]> ......<![endif]--> to hide code from standard browsers and
<![if !IE]> .......<![endif]> to show code only in standard browsers.
But some web-pages have code like:
<!--[if !ListSupported]-->......<!--[endif]-->
While most of us can agree that this is bogus code that shouldn't really be
there, it breaks the parsing because ganon sees the "<!--[if" and correctly
assumes this is a conditional. It then fails to find the "]>" that ends this.
As a result the rest of the file is skipped.
I have considered various hacks to make this parse correctly, non of them are
very pretty.
The new ganon.php I have included implements a "look-ahead" function called
if_conditional() in HTML_Parser_Base (line 581).
The function is used in parse_tag() of the same class (line 527). When it has
been determined that the tag starts with "<!--[if" it also calls the new
function which has to return true for the element to be parsed as a conditional.
The function looks ahead from the current position to find the next ']' and
then '>'. It then looks to if the characters before the '>' match '--'. If they
do the function returns false.
As a result the tag is parsed as a comment, NOT as a conditional. For my
purposes this works!
I have included a small test that illustrates the issue and checks that the
original uses of the conditional tag are still parsed correctly.
There may be better and more elegant ways of acheiving this result, but this
has worked for me.
Kind regards
Torben from Denmark
Original issue reported on code.google.com by TorbenEl...@gmail.com on 12 Jul 2015 at 8:52
Original issue reported on code.google.com by
TorbenEl...@gmail.com
on 12 Jul 2015 at 8:52Attachments: