jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 123 forks source link

pullparse is mis-reading token #66

Open idella opened 12 years ago

idella commented 12 years ago

Traceback (most recent call last): File "/mnt/gen2/TmpDir/portage/dev-python/mechanize-0.2.5/work/mechanize-0.2.5/test/test_pullparser.py", line 274, in test_tokens self._test_tokens(pc, tolerant) File "/mnt/gen2/TmpDir/portage/dev-python/mechanize-0.2.5/work/mechanize-0.2.5/test/test_pullparser.py", line 290, in _test_tokens self.assertEquals(token.type, expected_token_types[i]) AssertionError: 'comment' != 'decl'

The very first line it's reading <!DOCTYPE and evaluating correctly to a 'decl'. It errors at p.get_token() is reading <!rheum> and evaluates it to 'comment' and not 'decl' the diff between the 2 is simply that the first char after <! is lowercase and it's not distinguishing it from <!--

Xarthisius commented 12 years ago

According to html standard <!foo> should be treated as a "bogus comment"[1,2]. That was fixed in Python2.7 recently[3].

[1] http://www.w3.org/TR/html5/tokenization.html#markup-declaration-open-state [2] http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state [3] http://bugs.python.org/issue13960