jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 123 forks source link

Error when parse a HTML page with a Character references without semicolon #83

Open carloborsoi opened 11 years ago

carloborsoi commented 11 years ago

Hi,

I am new in this kind of forum and I found a problem to use br.links() which I guess I get the solution.

Where can I publish and discuss this solution ? The problem can solve directly in _sgmllib_copy.py but it is possible a workaround in _html.py

The problem is: In some HTML pages, the _sgmllib_copy.py suppose some Character references (e.g. &#39) are in hexadecimal base because it finishes with A-F but it is not because there is no 'x' in the begin.(e.g. Gustaf&#39Aldo ).

Solution: To avoid to change the _sgmllib_copy.py, it is possible to change the _html.py in line 315 from: if name.startswith("x"): name, base= name[1:], 16 to if name.startswith("x"): name, base= name[1:], 16 else: name = filter(lambda x: x.isdigit(), name)